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Abstract 


Arab writers have serious deficiencies in using punctuation marks. Linguists call 
for Arabic punctuation to be revised and the rules better-defined. Some believe 
that punctuation should be anchored in grammatical rules, but what level of 
competence should the language user possess to be able to correctly punctuate a 
piece of text? Can Machine Learning (ML) algorithms handle automatic 
punctuation. annotation of Arabic texts? Which of the machine learning 
algorithms could be the best suited for this task? Could machine learning 
algorithms produce a model that would be used to punctuate any Arabic text? In 
attempt to answer these questions, three tagging algorithms have been 
experimented with here: Conditional Random Fields (CRF), N-gram and Hidden 
Markov Model (HMM). They have been trained and tested on the Boundary 
Annotated Qur'an (BAQ) Corpus after it was appended with Modern Standard 
Arabic (MSA) style punctuation annotation. The three ML algorithms have been 
tested on the corpus in the two conditions of coarse and fine annotation, using 
three Part Of Speech (3-POST): noun, verb and particle, and ten part of speech 
(10-POST): noun, nominal, verb, pronoun, et.al respectively. The results show 
that machine learning algorithms can perform automatic punctuation and that 
implementing the CRF model with coarse annotation can yield slightly better 
punctuation performance than fine annotation can. This supports the conclusion 
that the language user need not possess a high level of linguistic competence to 
learn how to correctly punctuate a piece of text. The CRF model scored 93.496 
and 86.4% for the accuracy and Balanced Accuracy Rate (BCR) respectively 
with the 3-POS tags. On the other hand, the CRF model scored 92.2% and 85.8% 
for the accuracy and BCR metrics for the 10-POS tags respectively. Furthermore, 
it has been found out that training the CRF tagger on Classical Arabic benefits 
greatly its automatic punctuation marks prediction of Modern Standard Arabic 
texts. For the sentence terminal prediction task, the N-gram model has the best 
performance compared with the two other ML algorithms. The N-gram model 
scored 91.8% and 70.0% for the accuracy and BCR respectively with the 3-POS 
tags, while it scored 91.8 and 69.696 for the accuracy and the BCR metrics 
respectively with the 10-POS tags. 


Chapter 1 
Introduction 
1.1 Thesis Overview 


This research aims to investigate and argue that automatic punctuation annotation 
including sentence terminal prediction is possible using machine learning algorithms on 
Classical Arabic and Modern Standard Arabic (MSA) alike. In addition, this research 
aims to examine which machine learning algorithm is best for this task. It also seeks to 
establish which level of linguistic competence the Arabic language user needs to have to 
punctuate Arabic text correctly. Furthermore, this study aims to develop an automatic 
system for inserting punctuation marks and sentence terminals in the Holy Qur'an. 
Then, it will use the knowledge gained there to punctuate any written MSA text. Note 
that this research does not taking into account the grammatical rules of the Arabic in the 


experiments. 


To achieve these goals, usage of modern punctuation marks was investigated. Then, 
machine learning algorithms were trained and tested on the Boundary Annotated Qur'an 
(BAQ) Corpus (Sawalha, et al., 2012) after some modification. Modern punctuation 
marks were extracted from «hi 3i Sayyid Qutb في ظلال القران‎ “fr zilal al-qur’an” (Qutb, 
1991) and then inserted in the BAQ Corpus. The results of testing different machine 
learning algorithms were then studied and comparisons were conducted using standard 
evaluation metrics. Finally, the best model generated by machine learning algorithm 
that was trained on the Qur'an was used to annotate MSA texts with modern 


punctuation marks. 


1.2 Background on Arabic Punctuation 


сс 99 


Punctuation marks are special marks such as (comma “,”, full stop “.”, colon 


66,99 


66,99 


semicolon “;”, question mark “2”, exclamation mark 


66499 


). They appear in texts to help 
readers correctly understand the contents. These marks are useful for enhancing 
interpretation of texts. They also define text divisions i.e. clauses, sentences, and 
paragraphs. Punctuation marks appeared 200 B.C in the Greek language. Modern 
English has a well-defined set of punctuation marks. Other languages also developed 
standards for using punctuation marks. Many researches were conducted to 
automatically punctuate texts of other languages such as Vietnamese, Chinese, and 


Japanese. 


Adding punctuation marks to a text improves its readability (Beaglehole and Yates, 
2010). In addition, they provide rich information useful for Natural Language 
Processing applications, such as syntactic analysis, segmentation, part of speech 


tagging, machine translation, information extraction, parsing and phrasing. 


1.3 Punctuation Annotation Challenges in Arabic 


In Arabic readers emphasize the meaning of texts using pitch accent when reading a 
text aloud. Pitch accented words help listeners understand the meaning of a text. In 
silent reading readers need punctuation marks to highlight the pitch accented words in 
the text, so that the reader can have better understanding of it (Gordon, 2014). Early on, 
Arabic linguists realized the importance of introducing punctuation marks to Arabic and 
defining the rules for using each mark. In 1917 punctuation marks were introduced to 


Arabic for the first time by Ahmad Zaki Basha (Zaki, 1930). 


However, Arab readers and writers have serious deficiencies in using suitable 
punctuation marks (Khafaji, 2001). Therefore, punctuation marks for Arabic need to be 


revised and their usage rules better defined. 


1.4 Using Punctuation Marks in the Holy Qur'an 


The Holy Quran is the central religious text of Islam and the original text of the final 
revelation of Allah. The Holy Qur'an is an excellent source of ideas and it is a fertile 
ground for Islamic studies, and for the development of Arabic linguistic theory. The 
Quran is considered as a perfect gold standard text for developing, modeling, and 
evaluating Arabic NLP applications. The Qur'an as a corpus consists of 114 chapters 
which are made up of 6,243 verses and 77,430 words according to (Sawalha, et al., 
2012). Adding punctuation marks to the Quranic text can help readers decompose 
longer verses into shorter sentences that are easier to understand their meaning. For 
example, inserting quotation marks indicates to the reader that speech is a direct speech. 
The Qur'an already has prosodic (stops and starts) marks that scholars inserted into the 
text. These marks help the reciter while reading the text of Qur'an. The locations of 
pause and start prosodic marks are meaning determined. We would talk more about 
these marks in a subsequent section. Table 1.4.1 shows two examples of punctuated 
Qur'anic texts. The first example includes 1 to 3 verses of Surat Al-Baqarah. The 


second shows verses 12 and 13 of Surat Al-Imran. 


Table 1.4.1: Examples of punctuated text from the Qur'an 


Saheeh International 


translation 


Punctuated Verses 


according to Sayyed 


Qutb 


Not punctuated 


Alif, Lam, Meem. This is 
the Book about which there 
is no doubt, guidance for 
those conscious of Allah. 
Who believe in the unseen, 
establish prayer, and spend 
We have 


out of what 


у os 


ЕТЕ 
" zi А» $34 
OMS (SAD (4.2 ربب‎ 


oss all BESS “yall 


ола — 4‏ ل 

Со зао c 
eb 58% الین‎ 41( ой 
Азу 458 МЛ ويون‎ 


fighting in the cause of 


Allah and another of 
disbelievers. They saw them 
[to be] twice their [own] 
number by [their] eyesight. 
But Allah supports with His 
victory whom Не wills. 
Indeed in that is a lesson for 


those of vision. 


NIS (7-1) سورة البقرة‎ {ТУ oid 
provided for them. 
Say to those — who 4 gfo 0,6 GÀ) وََسَرُونَ | قل‎ 50x uS. 3 قل‎ 
disbelieve, "You will be 7 F А ; 
x5 e Ul] о VYP Мый T. 
overcome and gathered JUS p Ol 553 b А Je o 2d 5 
together to Hell, апа! فى‎ A < Oe 55 561 ші لتمّتا‎ [C à Fi 515 38 
wretched 15 the resting іш e e i- 4 
" 2115 а (Ее | 25%, де n الله‎ 5 
place.". Already there has 2 po B نولي‎ e а. 5% 92 
been for you a sign in the | 536 (ii; «aJ de | بتصرهِ‎ 5% adis 3 p Qj ndis 
two armies which met - one |, 25 is 
өреден сала | في 95 ليره لأولي‎ Spo sS من‎ 


٠١ 7‏ 4 سورة آل عمران (Ww-W)‏ 


1.5 Research Motivation 


Since punctuation marks are helpful for better understanding of any text (Gordon, 
2014), this research aims to develop an automatic punctuation system that can insert 
punctuation marks into the Holy Qur'an using ML algorithms. The knowledge gained 


from this punctuation annotation would be used to punctuate text. 


Having a new version of the BAQ Corpus annotated with punctuation marks will make 
a useful gold standard for machine learning algorithms. Machine learning algorithms 
applied to the BAQ Corpus will learn patterns (transferable knowledge) and then use 


these patterns (knowledge) to help in automatically punctuating MSA texts. 


This study will help Muslims around the world to read and understanding the Qur'an. 
This study is a multidisciplinary study that contributes not only to Information 
Technology including Computational Linguistics, Language Engineering, and Machine 
Learning but also to other disciplines such as Islamic Studies, Linguistics, and Arabic 
language learning. Punctuation annotation transferable knowledge would be useful for 
many NLP applications, such as POS tagging, name entity recognition, phrasing, 


segmentation etc. 


1.6 Research Aim and Objectives 


The main aim of this research is to develop an automatic punctuation annotation system 
including sentence terminal prediction for Qur'anic text using ML algorithms. The 
transferable knowledge of punctuation annotation from the BAQ Corpus would then be 


used to punctuate any Arabic text. 


To achieve this goal, we will study modern punctuation marks, annotate the BAQ 
Corpus with the modern punctuation marks used in чый سيد‎ Sayyid Qutb explanation 
في ظلال القران‎ “fi zilal al-qur’an” (Qutb, 1991), train and test ML algorithms on the ВАО 
Corpus, compare between the produced results, and finally use the best BAQ trained 
ML algorithm to punctuate MSA texts. The BAQ Corpus consists of 77430 words 
including its linguistic features such as; part of speech tags (noun, verb, and pronoun 


et.al). The BAQ corpus structured of 8230 sentences of the Holy Qur'an. 
The main objectives for this study as follows: 


1. Surveying the research on punctuation marks prediction. 

2. Modeling punctuation marks prediction and sentence terminal prediction on the 
Qur'anic text using ML algorithms. 

3. Comparing the results offered by the ML algorithms. 

4. Using the ML algorithm that performed best on the BAQ Corpus to punctuate a 
piece of MSA text with modern punctuation marks. 


5. Evaluating the obtained results. 
1.7 Research Methodology 


The methodology that we will use for building an automatic punctuation annotation 
system relies on the use of ML algorithms. We will extract modern punctuation marks 
from “fi zilal al-qur'an" (Qutb, 1991), and then insert them into ће ВАО Corpus. Тһе 
BAQ Corpus then will be split into training and testing datasets and ML algorithms 
would be trained and tested on them. The obtained results would be compared and the 
best ML model would be used to punctuate MSA texts after training on the BAQ 


Corpus. We will discuss this methodology in detail in Section 3.1. 


1.8 Thesis Structure 


This thesis is structured as follows: 

1. Chapter one: an overview of the thesis is presented. It defines the main aims and 
objectives of this research. A brief introduction of modern punctuation marks is 
made and punctuation challenges are identified. In addition, prosodic 
punctuation marks in the Holy Qur'an are discussed. Finally, the research 
motivation and its methodology are delineated. 

2. Chapter two: describes modern punctuation marks and their usage in Arabic 
texts. A review for prosodic (Starts and Stops) marks and their usage in the 
Qur'an is presented. The usage of punctuation marks by Sayyid Qutb was 
discussed. A comparison between the Makki and Madani chapters was 
presented. A literature review for the researches of punctuation annotation was 
presented. 

3. Chapter three: presents our methodology for building an automatic punctuation 
system for the Qur'an and MSA text, in addition, it discusses the methods and 
tools used for implementing the methodology i.e. NLTK, BAQ corpus and ML 
algorithms. 

4. Chapter four: presents the design and the plan for implementing the 
experiments, declaring the problem of punctuation and sentence terminal 
prediction. 

5. Chapter five: discusses the evaluation metrics used to measure the performance 
of the ML algorithms and the problem of skewed data. In addition, presents and 
discuss the results of applying the ML algorithms on the BAQ Corpus and MSA 
text for punctuation and sentence terminal prediction. 


6. Chapter six: summarize the conclusion of this research and the future work. 


Chapter 2 


Literature Review 


Several researches were tended to discuss the issues of punctuation annotation and 
sentence terminals prediction, due to the importance of these topics in many of Natural 
Language processing applications, such as; information extraction, segmentation, POS 


tagging, automatic speech recognition systems et al. 


The conducted researches for different languages aimed to use machine learning 
algorithms for punctuation and sentence terminal prediction. One of the first researches 
was (Beeferman, et al., 1998) for English language, while for Arabic language many 
researches discussed the sentence terminal detection (Sawalha, et al., 2012) issue but 


not punctuation annotation. 


This chapter addresses different topics; the difference between punctuation marks and 
its usage in Arabic language, the prosodic marks and its usage in the Qur'an and the 
difference between Madani and Makki chapters in the Qur'an, also it presents the usage 
of punctuation marks in the explanation of Sayyid Qutb. In addition, it surveys the 
researches of punctuation annotation and sentence terminal prediction experimented 


with different machine learning algorithms. 


2.1 Punctuation Marks Overview 


2.1.1 Punctuation Marks for Arabic 


Readers need special tones and symbols in order to facilitate understanding and 
realizing written texts. While other nations (Foreigners) realized the importance of this 


task they were the first how proceeded to use special symbols aim to declare written 


texts by separating sentences to enable the readers to diversifies his tones while reading 
in order to distinguish different type sentences such as; stops, starts, questions, etc. 


(Zaki 1930). 


Arab language was suffering shortages in this area, as readers feel difficulty to 
understand the meanings and the purposes of written Arabic texts in the absence of such 
signs and symbols. From here the need to use special was revealed, as Arab language 
scientists proceeded to put some of these. One of the scientists was "Ahmad Zaki 
Basha", who proceeded to combine punctuation marks used in foreign languages and 
modifies them according the rules of Arabic language. These marks are: (Comma 
"ААА, Semi Colon "*" åh idl ,الفاصلة‎ Full Stop "." ,النقطة‎ Question Mark "$" علامة‎ 
,الإستفهام‎ Exclamation Mark ,علامة التعجب"!"‎ Colon ":" ,النقطتان الرأسيتان‎ Ellipsis Mark "..." 
cil ,علامة‎ Hyphen mark "-" 454!) ,علامة‎ Quotation mark " )) " ,علامة التنصيص‎ 
Parentheses mark " () علامة القرسان"‎ ). Table 2.1.1.1 declares these punctuation marks 


with examples. 
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Table 2.1.1.1: Types of punctuation marks and its usage in Arabic language. 


# Punctuation Usage Example 
Mark t 5 كه‎ 

1 Сотта "<" | The purpose is to keep silent reader for a little البقرة‎ (Y) لا رَيْبَ فيهء هدَّى للْمُتَقِينَ‎ caer ذلك‎ 
while to differentiate sentences from another | This is the Book about which there is no doubt, a guidance for 
and its used between multiple clauses in | those conscious of Allah (2) 
compound sentences. 

2 | Semi Colon" +" | The purpose is to keep silent reader for a | a А15 «АШ إلى‎ Gl 8 tans уа д) في‎ 4 GE الذي‎ А 
period more than for a comma. lt's used (۲۹)البقرة‎ Alle شَْيْءٍ‎ OS Уу سَمَاوَات؛‎ 
between long sentences formed a full It is He who created for you all of that which is on the earth; 
meaning or between two sentences where the | then He directed Himself to the heaven [His being above all 
second sentence is the causative for the first | creation] and made them seven heavens; and He is Knowing of 
sentence. all things. 

3 Full Stop". " | The purpose is to use at the end of a full (؟)الفاتحة‎ айай G5 لله‎ Saal 
meaning sentences. (АШ praise is [due] to Allah Lord of the worlds. (2) 

4 | Question Mark | Тһе purpose is to use at the end of the ةيشاغلا)١(؟‎ дА تاك حَديث‎ ДА 

ee question sentence. Has there reached you the report of the Overwhelming 

[event]? (1) 
5 Exclamation | It's used at the end of sentences to express ad (VY) Agel وَمَا أَذْرَاكَ مَا‎ 
Mark “!” excitement, surprise, astonishment or any | And what can make you know what is [breaking through] the 


strong emotion. 


difficult pass! (12) 
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6 Colon " :" It's used to precede say or to explain, proves, سمشلا)١7(‎ АС فقال لَهُمْ رَسُولٌ الله: 4865 الله‎ 
defines or list elements. And the messenger of Allah [Salih] said to them: "[Do not harm] 
the she-camel of Allah or [prevent her from] her drink." (13) 
7 | Ellipsis mark " ... | It's used in place of omitted speech. aS إن‎ 46 OA وَأن تصدقوا‎ © ума ДО эуе كَانَ ذو‎ gh 
f ةرقبلا)١8١(َنوُمَلْعَت‎ 
And if someone is in hardship, then [let there be] postponement 
until [a time of] ease. But if you give [from your right as] 
charity, then it is better for you... if you only knew. (280) 
8 | Hyphen mark " - | It's used between two sentences if the first | Xi تَجْرِي من‎ Gls ago عند‎ 158) Qul © مّن ذَلِكُمْ‎ o SESS Us 
0 sentence is too long and between the number | 52 АА уз ai ga Оқа) 524 £525 - خَالِدِينَ فيهًا‎ - ui 
and the numbered. عمران‎ لآ)١5(‎ з 
SAY, "SHALL I INFORM YOU OF [SOMETHING] BETTER THAN 
THAT? FOR THOSE WHO FEAR ALLAH WILL BE GARDENS IN THE 
PRESENCE OF THEIR LORD BENEATH WHICH RIVERS FLOW - 
WHEREIN THEY ABIDE ETERNALLY - AND PURIFIED SPOUSES 
AND APPROVAL FROM ALLAH. AND ALLAH IS SEEING OF [HIS] 
SERVANTS - (15) 
9 | Quotation Mark " | It's used to represent direct speech, quotation أذهب معكم في الحافلة)).‎ o) قال لي:‎ 
p'or""" or phrase. He said to me Е will not go with you in the bus". 
10 | Parentheses mark | It's used to include speech that is not related أكثر مدن الاردن تعداداً للسكان.‎ (o: oy iaae) عمان‎ 


"Q" 


to the sentence 


Amman (Jordan's capital) more cities populous. 
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2.1.2 Prosodic Marks (Starts and Stops) in the Qur'an 


Qur'an punctuation was proposed by Arabian linguistic scientists in the (16 century), 
where they worked to produce distinctive punctuation system to facilitate the recitation 
of the Qur'an, these marks of stops and starts placed in the middle of the verses, at the 
end of the word they refer to. Table 2.1.2.1 defines different types of starts and stops 


signs. 


The reciter of the Holy Qur'an must have a good knowledge of the pause and the starts 
(Al Waqf and AI Ibtida’) science which has close association with the syntax and the 
meanings of verses of the Holy Qur'an. Prosodic marks science has two main important 


points for the reciter and the listener of the Qur'an: 


1. Clarifying the meanings of the verses to the listener whenever the reciter knows 
where he has to pause and starts. 
2. Disparity of the Pause degrees has a real relation with the understandings of the 


Holy Qur'an and the amount of knowledge with it. 


Ibin Al Jazari said (one of the greatest Muslims Scientists) “Many previous Muslim 
imams stipulated that no one can has a license in the Holy Qur'an unless he has a good 


knowledge in the starts and stops science". 


As an example of the relation between the pause and starts science and the meanings of 
the verses in the Holy Qur’an:- 


) المائدة‎ : Y Y) القاسقين)‎ „АЙ على‎ (a فلا‎ D الأزض‎ „4,5 фый: Aio أرْبَعِينَ‎ ٠. ДЕШЕ 45да Gi DY) 
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[Allah] said, *Then indeed, it is forbidden to them for forty years [in which] they will 
wander throughout the land. So do not grieve over the defiantly disobedient 


people." (26: Al-Ma'ida). 


The above verses has two Interchangeable Pause mark (..) which means that the reciter 
has to stop on either of them which will has an impact on the meaning of the verses i.e. 
if the reciter stopped at the first site this means that the land is forbidden for the people 
(the Jews) forever and they will wander throughout the land for forty years. While, if 
the reciter stopped at the second site this means that it is forbidden for them for forty 


years and they will wander throughout the land (Shaker, Youssef et.al 2012). 


Pause type 
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Table 2.1.2.1: Pause and starts signs. 


Definition 


Example 


e compulsory |The reciter is mandatory to | ЯУ! 46185 а Cisind أَغَنِيَاءُ‎ 53; да alll óy سَمِعَ الله قول الَّذِينَ قالوا‎ 
pause. عمران)‎ Di AY) الْحَريق‎ clie 1 55 D АУДА ДЫ 
Allah has certainly heard the statement of those [Jews] who said, 
"Indeed, Allah is poor, while we are rich." We will record what they 
said and their killing of the prophets without right and will say, 
"Taste the punishment of the Burning Fire. (181:Al-i-Imran) 
c Permissible | The reciter is permissible to | 553 G38 W “wb رَبّنَا‎ là آمِنُوا بِرَبَّكُمْ‎ OF OU Gal Gal سَمِعْنَا‎ US] 07, 
pause or continue. آل عمران)‎ :\47) Jos giis um бё 
Our Lord, indeed we have heard a caller calling to faith, [saying], 
'Believe in your Lord,' and we have believed. Our Lord, so forgive 
us our sins and remove from us our misdeeds and cause us to die 
with the righteous.(193: Al-i-Imran ) 
صلي‎ Continuation | The reciter is allowed to pause, (оза! يَعْلَمُونَ(؟:‎ Gud Ga) деу | nuns | ISG д5 در‎ 
preferred however continue is preferable. | Let them eat and enjoy themselves and be diverted by [false] hope, 
for they are going to know. (3: Al-Hijr) 
قلي‎ Pause preferred | The reciter is allowed to | ذا‎ ЧЁ LS bie وَوَجَدَ‎ аі оғ في‎ DI الشس وَجَدَهَا‎ Сз йл a ٳڏا‎ (us 
continue, however pause is (ASIA) GLA ы Sas أن‎ Us c أن‎ La} الْقَرْنَيْنِ‎ 
preferable. Until, when he reached the setting of the sun, he found it [as if] 
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setting in a spring of dark mud, and he found near it a people. Allah 
said, "0 Dhul-Qarnayn, either you punish [them] or else adopt 


among them [away of] goodness." (86: Al-Kahf) 


Not permissible 


The reciter is not permissible to 


pause. 


5 قيل لَهُم 01134 2855 Sa) ls‏ الْأَوَلِينَ(؛ ؟:النحل) 
And when it is said to them, "What has your Lord sent down?"‏ 


They say, "Legends of the former peoples," (24: An-Nahl) 


Interchangeable 


These marks placed in two 
sites. The reciter can pause on 


either of them but not on both. 


ذلك الكتابٌ لا су)‏ فيه © САЙ ode‏ (؟:البقرة) 
This is the Book about which there is no doubt, a guidance for those‏ 


conscious of Allah - (2:Al-Baqara) 
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2.1.3 Sayyid Qutb and the Usage of Punctuation Marks 


While the science of Pause and Starts (Al Waqf and AI Ibtida") is somehow hard to 
learn and to be familiar with, an urgent need for an interpretation of the Holy Quran 
that includes explicit form of Pause and Start. One of the greatest books for 
interpretation of the Holy Qur'an is fi zilal al-qur 'an (Qutb, 1991), who interpreted the 


Holy Quran in distinctive manner using Arabic punctuation marks. 


While Sayyid Qutb was one of the greatest jurists, thinkers and interpreters of the Holy 
Quran, his interpretation. and using of the punctuation marks was upon his 
understanding of the Holy Qur'an, the science of the Hadith, and his informed of other 
explanations books of Qur'an (Iben Katheer explanation). Said Qutb Used a set of 
punctuation marks in his interpretation (fr zilal al-qur’an) of the Holy Qur'an, these 


marks are (Comma "<", Full Stop ".", Question mark "©", Exclamation mark "!", 


Semicolon "*", Colon ":", Hyphen mark " - ", Ellipsis mark "..") (Alkhaldi, 2000). 


Punctuation marks in " fi zilal al-qur’Gn " used to replace the original Pause and Starts 
marks (АІ Waqf and AI Ibtida' marks) and such as a tool to explain the meanings of the 
speech of Allah. ff zilal al-qur’an book is one of the best explanation books of Qur'an, 
which can be used by many of the researchers in Islamic Shari'a. Sayyid Qutb did not 


talk explicitly about his usage of punctuation marks in his explanation (Alkhaldi, 2000). 


2.1.4 Makki and Madani Chapters of the Holy Qur'an 


The Holy Qur'an was revealed over twenty three years from the begging of preach. The 
first thirteen years was in Makkah where the general chapters of the Quran revealed 
there, which then called Makki chapters, then the prophet Mohammad migrated to 


madinah where the rest of the chapters of the Qur'an revealed there, which then called 
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experimentation for all the verses of the Qur'an 


Madani chapters. The Makki and Madani chapters have many substantive differences; 
the Makki chapters concentrate on the adoration only one God and to dismiss of the 
dominant religion in Makkah before, while the Madani chapters started to declare the 
laws of Islam to regulate the life of people. Makki and Madani chapters have many 


other characteristics, table 2.1.4.1 describes some of key differences between them: 


All these important differences between chapters of the Qur'an forced us to split the 
BAQ corpus into two datasets and ensures that this process would be rearrange and 


interchanged repeatedly to make the cross validation experiments and guarantee a fair 


Table 2.1.4.1: Differences of Makki and Madani chapters. 


Makki chapters 


Chapters that have verses that 


command Muslims to prostrate to 


Allah. 


Madani chapters 
Commit the Jihad and mention its 


rules. 


Chapters with verses that contains the 
word of “Kalla” OS) in the second 
half of the Qur'an. 


Mention the details of legal Islamic 
system that governing people in the 


community. 


Short verses. 


Long verses. 


Hard oratorical style. 


Easy vocabulary. 


Arguments with the gentile and 
objection their connection of partners 


with Allah. 


Argument with the people of the Book 


i.e. Jews and Christians. 


Chapter with the verses that have the 
phrase (يا أيها الناس)‎ *O Mankind”. 


Chapters with verses that starts ( يا أيها‎ 


*O you who believe".‏ (الذين امنوا 
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2.2 Punctuation Marks Annotation for Foreign Languages 


2.2.1 Punctuation Annotation as a Classification Problem 


Data Analysis is a concept concerned with extracting models to describe class models or 
to predict some continuous values. Classification models are related with predicting 
categorical class labels, namely assigning class labels to a set of unclassified case 
(David and Balakrishnan, 2010). As example of classification, a bank loan officer wants 


to analyze customer's data to determine which of them deserves a loan or not. 


Classification modeling has two processes (phases): 


l- Building the classifier model. 

2- Using the classifier model for classification. 
In the first phase, the classifier built using the classification algorithm and a set of 
training data consisting of tuples, where each tuple is associated with it's a class label. 
A set of classification rules produced of this phase which is called training phase. In the 
second phase, classification rules are used to assign class labels to a set of test data 
lacking of these labels. Accuracy of classification model is measured through this 


phase. 


Prediction of punctuation marks is one of the classification problem applications. Where 
a predefined training set must be exist, that is consisting of a set of tuples such as; 
words, POS and punctuation tags, these tuples present the features that would be used to 


solve such problems of punctuation annotation. 


Many of machine learning algorithms such as; Hidden Markov Model (HMM), N-gram 


model, TNT model, Conditional Random Field (CRF) et al., are used to solve problems 
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in Natural Language Processing such as; information retrieval, machine learning, 
speech recognition, machine translation, intelligent character recognition et al., and one 
of these problems are a classification problems i.e. punctuation marks annotation and 


sentence terminal prediction. 


2.2.2 Related Works 


One of the first systems, although not the first is the system of predicting punctuation 
marks in speech text using ML algorithm i.e. Trigram language model, with the 
application of the Viterbi algorithm is the system proposed by (Beeferman, Berger et 
al. 1998) for the purpose of punctuation restoration. The proposed system was applied 
based only on lexical information of Penn Treebank corpus for English language with 
condoning of other linguistic features such as Part Of Speech tags, as it can helpful in 
the disambiguation of the word categories which can be considered as one of the 
disadvantages of the system. Another lack of the system is relying on the trigram model 
which can be feeble in elicitation of long-range dependencies between the words along 
the sentences. Anyhow, the experiments on the proposed system reveals good results of 
sentence accuracy (about 54.0%) as it considered more denotative for real-world 


interests, also it have achieved high Precision and Recall scores. 


Another applied model for punctuation annotation is the CRF model that has a 
reliability for labeling sequential data due its functioning in predicting punctuation 
marks in many different languages (Lafferty, et al., 2001), where it takes into account 
the context of the observation and its ability to recept multi-layer of functions for taking 
observations. (Lu and Ng, 2010), proposed a model of a multifunction task: sentence 
boundary, sentence type prediction, and punctuation prediction for a speech utterance. 


The proposed approach has been applied in English and Chinese languages, using 
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International Workshop on Spoken Language Translation (IWLST) corpus (Paul, et al., 
2010) which consists of conversational speech texts, where short and more question 
sentences appeared compared with other corpora. 

Results have been concluded throw experiments referring that punctuation annotation 
depends on both words and sentence levels have better results compared with depending 
on just a set of contiguous observations (words). Accordingly, Linear-chain conditional 
random fields (Lafferty, et al., 2001) would not be useful based on its function of 
labeling sequences of words. Based on that, they have proposed the usage of Factorial 
CRF (F-CRF) (Sutton and McCallum, 2006) to do the task of word level labeling, and 
both sentence boundary detection and sentence type prediction level. The word label 
layer is used for annotating punctuation symbol (e.g. NONE, COMMA, PERIOD), and 
the sentence layer used for both declaring sentence boundaries (DEBEG, DEIN, 
QNBEG, QNIN, EXBEG, EXIN) and sentence type (Declarative, Question, 
Exclamatory) such that; DEBEG refers to declarative sentence begin, where the GRMM 
package (Sutton and McCallum, 2006) was used for building and training the Linear- 
Chain CRF and the Factorial CRF models. The conducted experiments approved that 
using a Factorial CRF model has many advantages over using Linear-chain CRF model, 
because of its ability for handling multi-layer of sequence tags which can be used 
simultaneously into determining the inner words of sentences (sentence boundaries) and 
the tags for the observations (words). The F-CRF model has more significant 
improvements for English more than Chinese, because of the linguistic diversity of the 
Chinese. For example the indicative words of question sentences can be located in any 
place within the sentence of the question. 

Another research for predicting punctuation marks for Chinese was (Zhao, et al., 2012) 


where a proposed the method uses Conditional Random Field (CRF) model as in 
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(Lafferty, et al., 2001). They configured the task of punctuation prediction by extracting 
some linguistic features at three levels such as: word, phrase and function (a 
combination of the current, the preceding and the following words and their 
corresponding phrase tag) levels. They assumed that using more linguistic features will 
help full and outperforming the systems with fewer features used, where the CRF 
sequence labeling method has been applied to be the framework of the proposed system. 
The overall methodology stands for using already punctuated Chinese text from 
Tsinghua Chinese Treebank corpus (Qiang, 2004) for training the proposed model 
which has many Chinese linguistic features. They removed all punctuation marks, then 
create three levels of linguistic information (word POS, phrase and functional chunks) 
to restore punctuation marks using CRF-based sequence labeling method. Several 
experiments were conducted using mixed features of the extracted levels with different 
length of sentences. The experiments showed that combining features of phrase and 
functional levels obtained best results of inserting punctuation marks into the Chinese 
text compared with using phrase-level or functional-level features alone. The proposed 
system has a discriminative approach for annotating punctuation marks using only 
lexical information, also its ability of working on a phrase level, which considered as 


one of the most helpful features for annotating punctuation marks into texts. 


(Pham, et al., 2014) investigated the prediction of punctuation marks for Vietnamese. 
They have proposed a system for punctuation annotation using linear-chain Conditional 
Random Fields (CRF) model as in (Lafferty, et al., 2001). Because of Vietnamese has 
no lexical corpus for the task of punctuation mark prediction, a corpus has to be 
configured for this purpose. Accordingly, online news journals and papers have been 
adopted after some preprocessing steps such as; deleting redundant data and modifying 


digital numbers to written numbers, to produce 240,000 words and a set of seven 
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punctuation marks i.e. comma (,), period (.), colon (:), semicolon (;), ellipse (...), 
question mark (?) and exclamation mark (!). Two systems of tagging have been 
proposed (concise and expanded tagging sets). The first system (concise) aimed to label 
the non-punctuated word with O and punctuated word with one of punctuation marks 
i.e. comma, period et.al. The second system concentrates on the fact that there is a 
relation between current punctuation mark and the previous one. Accordingly, each 
non-punctuated word was labeled with O with the type of the previous punctuation 
mark (e.g. O/comma), in the other hand, the punctuated word was labeled with the type 
of punctuation mark. Based on the two previous systems a set of features has been 
proposed and trained with the CFR++ toolkit (Kudo, 2005). The best set of features was 
selected to train and test the proposed system. A Group of experiments have been 
conducted on the two labeling systems depending on the selected features, then 
compared with each other. They conclude that using the best set of features with the 
expanded system they could get the best Е, Scores (52.89%), and that is because of the 
high dependability of the current punctuation mark and the previous one. This 
conclusion was proved by applying the expanded system. The proposed system has got 
a good results score regardless to the lack and the shortage of the training data. In 
addition, the system does not depend on any other linguistic feature such as; POS tags, 


which could be considered as a disadvantage of the system. 


Punctuation annotation also has an important application in the Automatic Speech 
recognition systems; which is responsible for translating spoken words into written texts 
in a real time (Stuckless, 1994). Because of the inability of such systems of automatic 
inserting of punctuation marks into transcribed texts, which has a direct relation with 
the information extraction process; that is comprising of many features extractions, such 


as; Part Of Speech tagging and named entity tagging and many other features. (Hillard, 
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et al., 2006) investigated the impact of automatically comma annotation on the Part-Of- 
Speech tagging and named entity tagging through an Automatic Speech Recognition 
(ASR) system for Mandarin language. The proposed system consists of many phases; 
automatic speech recognition, punctuation annotation, Part-Of-Speech tagging, and 
finally name tagging. Various techniques and algorithm taggers were used for each of 
the previous phases; SRI Decipher recognizer (Stolcke, et al., 2006) used for speech 
recognition system, while the comma annotation and sentence boundary detection are 
implemented based on the ICSI+ multilingual sentence segmentation System tools 
(Zimmerman, et al., 2006), and a two taggers were used for Part-Of-Speech; Viterbi and 
N-gram taggers, and finally name tagging was performed based on the HMM tagger. 
The training corpora were built basically based on Mandarin transcribed news and 
Chinese textbooks and also benefits from many other resources. The proposed system 


proved the hypothesis that comma prediction will improve the POS and name tagging. 


Another investigation that illustrates the importance of sentence boundary detection and 
punctuation annotation on the translation quality using Automatic Speech Recognition 
(ASR) system was proposed by (Matusov, et al., 2006). The proposed model that was 
based on translating the output text of the ASR (long sequence of words) from the 
source language to the target language after two consecutive steps; sentence boundary 
detection (segmentation) and punctuation prediction. Three strategies were investigated 
for the proposed task. The first strategy was based on detecting sentence boundaries, 
and then translates the segmented sentence to the target language. The produced 
sentence would be tagged with punctuation marks, taking into account that punctuation 
marks were removed from the source and the target languages. One of the drawbacks of 
this system that is punctuation annotation process has to be done based on just using 


lexical information. Another drawback is that punctuation annotation would be based on 
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a given translation that is will not be free of mistakes, which can lead of increasing 
errors of punctuation prediction. The second strategy, consisting of two stages; the first 
strategy is detecting sentence boundaries, the second is integrating the process of 
translation and the prediction of punctuation marks for the target language, taking into 
account that punctuation marks were removed only from the training source corpus, 
consequently, the translation method would produce two type of phrases, one type 
without punctuation marks and another type with punctuation marks. Log-linear method 
was used to select between these two phrases. The third strategy, aimed to integrate the 
process of detecting sentence boundaries and punctuation predictions together in the 
source language, then translating the output to the target language, taking care of 
maintaining the punctuation marks in both of the source and the target training corpora. 
Because of punctuation rules of the bilingual languages could be different from each 
other that could be one of the drawbacks of this strategy. After they have analyzed the 
results of the previous three methods, method number two was the most appropriate for 
the proposed system. Measuring the quality of sentence segmentation was conducted 
using Precision and Recall metrics on TC-STAR task (English-to-Spanish) and IWSLT 
(Chinese-to-English) task. Experiments revealed good results on the TC-STAR task that 
outperforms the baseline, which used the pause model to improve the segmentation 
process. On the other hand, experiments showed modest results for the IWSLT task, 
because of the lack of recognizing the identical words at the start and the end of 


sentences. 


Winnow algorithm (Blum, 1997) is a machine learning algorithm for feature extraction. 
(Charoenpornsawat and Sornlertlamvanich, 2004) used this algorithm to extract 
sentence breaks from a paragraph by determining the sentence breaking spaces by 


taking into account the context around the spaces, whether it's a sentence break or not, 
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where the space is the only punctuation mark used in Thai language to determine break 
sentences. A training data consists of a set of a segmented paragraph and a segmented 
sentence where each word is tagged with the appropriate POS tag, these training data is 
passed to the Winnow algorithm to learn from and build the model, then the testing set 
is fed to the produced model to evaluate the system. Trigram model was used for the 
process of word segmentation and POS tagging and the Winnow algorithm was used for 
the sentence break determination. Table 2.2.2.1 presents a conclusion of the related 
works section and its results, while table 2.2.2.2 presents the advantages and 


disadvantages for each work. 
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Table 2.2.2.1: Summary of the related works and their results. 


Method Paper Purpose Language Corpus Data Size Accuracy 
Trigram model | Cyberpunc: А | A speech recognition | English Penn Treebank | 1,265,577 trigrams, with | -Algorithm A: 
using Viterbi | lightweight system for corpus 185,420 commas. P: 75.696 
algorithm for | punctuation automatically R: 65.696 
two proposed | annotation punctuation annotation FA: 70.2% 
algorithms (A | system for | into text based оп Sentences 
and B) speech lexical information. Accuracy: 53.3% 

-Algorithm B: 
P: 78.4% 
R: 62.4% 
F4: 69.4% 
Sentence 
Accuracy: 54.0% 
Factorial Better A multifunction model | English IWLSTO9 corpus | A total of 31,000 of | -Chinese: 
Conditional Punctuation for: sentence | and (BTEC dataset and | Chinese-English P: 93.82% 
Random Filed | Prediction with | boundary, sentence | Chinese CT dataset) utterance pairs. R: 89.01% 
Model (F-CRF) | Dynamic type prediction and | languages FA: 91.35% 
Conditional punctuation annotation P:93.72% 
Random Fields for Speech utterance. R:92.68% 


Е,:93.19% 
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Conditional A CRF Sequence | Punctuation Chinese Tsinghua Training Dataset | P: 82.0096 
Random Field | Labeling annotation of Chinese | Languages | Chinese  Treebank | consists of R: 64.9096 
(CRF) Approach to texts using three corpus 57,865 Punctuation 606 
Chinese different levels of Marks 
Punctuation features. Testing Dataset consists 
Prediction of 13,515 Punctuation 
Marks 
Linear-chain Punctuation Configuring a system | Vietnames | Built corpus оѓ | 240,000 words with | Best set of 
Conditional Prediction for for punctuation | e online news | 66% for training and | features: 
Random Fields | Vietnamese prediction of | languages | journals and papers | 34% for testing P: 81.24% 
Texts Vietnamese language. R: 39.21% 
Using Fy :52.89% 
Conditional 


Random Fields 
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A combination | Impact Of | Investigating the | Mandarin 60,000 words for 
of Hidden- | Automatic influence of comma | language training and 
event language | Comma annotation on the Part- 
model and | Prediction On | Of-Speech and name 
boostexter POS/Name tagging in a speech 
classifier. Tagging Of | recognition system. 
Spanish 
Thelog-linear | Automatic Investigation of the | English, -TC-STAR task | 28,000 words for | P : 69.9% 
model. Sentence importance of | Spanish (English-to- training. R: 70.396 
Segmentation sentence boundary | and Spanish) corpus 
and Punctuation | detection and | Chinese And IWSLT | 5550 words for testing. 
Prediction Жог | punctuation annotation | languages. | (Chinese-to- 
Spoken on the quality of English) corpus 
Language translation using the 
Translation. Automatic Speech 
Recognition system. 
Winnow Automatic Extracting ^ sentence | Thai ORCHID corpus 10,864 sentence (90% 
algorithm sentence break | breaking space while | language for training and 1046 for 
disambiguation | spaces are the only testing) 
for Thai. punctuation marks 


used in Thai language. 
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Table 2.2.2.2: Advantages and disadvantages for each the used algorithm. 


Method 
Trigram Model 
algorithm/ Cyberpunc: 


punctuation annotation system for speech 


with using Viterbi 


A lightweight 


Advantages 
- Punctuation annotation based only 


on lexical information. 


Disadvantages 

- Poor performance in case of long range 
dependencies between the words required 
for the punctuation annotation purpose. 

- The lack of accreditation on lexical 
linguistic features such as Part Of Speech 
tags which can be of benefit in word 


category disambiguation. 


Factorial Conditional 


model/ Better Punctuation Prediction with 


Dynamic Conditional Random Fields 


Random Fields 


- Overcome traditional Linear-Chine 


Conditional Random Fields for 
Investigating long-range 
dependencies between the 


observations in the texts, because of 
its ability to handle multi-layer of 
sequence tags. 

-  Discriminative model for 
Predicting punctuation into English 


language texts. 


- Has modest scores in predicting 


punctuation for Chinese language texts. 
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Conditional Random Filed model/ A CRF 
Sequence Labeling Approach to Chinese 


Punctuation Prediction 


- Punctuation annotation based only 
on lexical information. 

- The ability to benefit from a phrase 
level feature in order to facilitate the 


investigation of sentence type. 


- The linguistic structure of the Chinese 
language could be one of the main 


challenges of this type of methods. 


Linear-chain Conditional Random Fields/ 
Punctuation Prediction for Vietnamese 
Texts 

Using Conditional Random Fields 


- Ability to capture long-range 
dependencies which could the most 
for the 


influential feature 


punctuation annotation task.- 


-Relying on a limited number of resources 
to build the training data which could cause 
lack and shortage of the training data. 

- Omission of some linguistic features such 
as POS which could be helpful to improve 


the scores of the system. 
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Chapter 3 


Methodology 


This chapter describes our proposed model for predicting punctuation marks in Arabic 
text. It describes the methodologies, tools and resources used to predict punctuation 
marks in Arabic text. Our methodology was based on comparing between 3 commonly 
used ML algorithms namely; CFR, HMM, and N-gram. We based our experiments on 
the BAQ corpus (Sawalha, et al., 2012). To implement ML algorithms used in this 


research we used the Natural Language Toolkit (NLTK) (Bird, et al., 2009). 


An overview of our proposed model is described in Section 3.1. NLTK is described in 
Section 3.2. The text source used in this research is the BAQ Corpus with added tiers of 
punctuation marks and sentence terminals. The new version of the BAQ Corpus is 
described and in Sections 3.3. Section 3.4 discusses in great details the ML algorithms. 


Examples of the different ML algorithms focused in using them for Arabic text. 


3.1 Methodology of Predicting Punctuation Marks for Arabic Text 


Our proposed model is based on using ML algorithms for predicting punctuation marks 
for Classical and Modern Standard Arabic text. In order to apply ML algorithms to 
predict punctuation marks in Arabic text, annotated training/testing data with 


punctuation marks is required. 


We have chosen the BAQ Corpus because it is suitable for training/testing ML 
algorithms and it has many linguistic features appropriate for our task such POS 
information. Section 1.5 discusses the motivation for choosing Qur'an as 


training/testing Corpus. Section 3.3 describes the BAQ Corpus in more details. 
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However, the BAQ Corpus does not include modern punctuation marks information. 
The only resource that adds modern punctuation marks to the Qur'an text is (fT zilal al- 
quràn) (Qutb, 1991). We add a new tier of modern punctuation marks to the BAQ 


Corpus Qutb's punctuation marks placement as in his book. 


The new version of the BAQ Corpus with added modern punctuation marks was used 
to train and test three commonly used ML algorithms. These algorithms are HMM, N- 


gram and CRF. 


Figure 3.1 shows an overview of our model. It is divided into two parts. The first part is 
data preparation where modern punctuation marks were added to the BAQ Corpus. The 
second part presents the application of ML algorithms to predict punctuation marks. The 


following are the major steps that have been achieved to build our methodology. 


Data Preparation: 


1. Annotating of the Quran by manually adding punctuation marks to the BAQ 
Corpus in accordance with Sayyid Qutb explanation (Qutb 1991). These 
punctuation marks are: Comma "‹", Semi Colon "*", Full Stop ".", Question 
Mark "*", Exclamation Mark "!", Colon ":", Ellipsis Mark "...", Hyphen mark "- 

2. Four punctuation marks (i.e. Full stop, Question, Exclamation and semi-colon) 
are used to mark the end of sentences (Zaki 1930). These marks were used to 
identify sentence terminals in the Qur’an. The corpus then is divided into 8366 


sentences. Defining sentence terminals would be used for punctuation 


annotation and sentence terminal prediction tasks. 
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Applying the ML Algorithms: 


1. Splitting the Qur'an corpus for the cross validation experiments into two 
datasets: training and testing parts, such that the training and testing parts would 
always be rearranged and interchanged. The training part would always occupy 
90% and the testing part would also occupy 10% of the corpus in any one 
experiment. 

2. Training the machine learning algorithms (HMM, N-gram and CRF Taggers) on 
the training datasets this will utilize two categories of Part-Of-Speech features 
i.e. three-POS tag set (noun, verb and particle) and ten-POS tag set (noun, verb, 
nominal, conjunction et al.) of each word in the corpus. 

3. Applying the trained models using the test datasets. Then, the obtained results 
by these algorithms are compared with the already punctuated text (gold 
datasets). 

4. Predicting punctuation marks to MSA text using the most accurate model 
obtained from the previous step. 


5. Finally, obtained results are evaluated. 


34 


Data preparation 


Manually extracting 


punctuation marks BAQ corpus 


and adding them to with punctuation 
Ет zilal al- the BAQ corpus marks 
quran 


Explanatio 


Our methodology 
Splitting corpus for Cross 

Validation and applying the 

models using ML algorithms 
(HMM, N-gram and CRF) 


Performance 
measuring using 


evaluation Metrics 


Applying the best ML 


model on MSA text 


Punctuated MSA 


Text 


Figure 3.1.1: The proposed model for predicting punctuation marks for Arabic text. 
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3.2 Natural Language Toolkit (NLTK) 


To implement the ML algorithms in this research we used the Natural Language Toolkit 
(NLTK). NLTK is a platform used for building Python programs in order to process 
human language data. NLTK firstly was created in 2001 at the University of 
Pennsylvania and it has been developed and improved by many contributors. While 
NLTK was characterized with its simplicity, consistency, extensibility and modularity, 
it has been used with many of NLP tasks such as; POS tagging, Text Classification, 


Text Chunking, Parsing et.al (Bird, et al., 2009). 


3.3 Dataset (Corpus) 


In order to accomplish any NLP task such as information extraction, machine 
translation, speech recognition or punctuation marks prediction, a corpus (dataset) is 
needed. This corpus must contain text data for the language under investigation and it is 
often constructed such that it would have lexical information on each word e.g. POS 
tags such as noun, verb, particle, etc. or morphological annotation such as root, stem, 


prefix, suffix, etc. 


Because the Holy Qur'an is the central religious text of Islam, it has received the 
attention of researchers throughout the ages. The Qur'an is positioned as a gold standard 
corpus in present NLP research with tens of postgraduate theses and journal articles 
using it to train, test, and develop computational linguistic resources. We used an 
existing corpus, the Boundary Annotated Quran (BAQ) Corpus (Brierley, et al., 2012). 
“BAQ Corpus is constructed of 77,430 words and 8,230 sentences, where each word is 
tagged with syntactic and prosodic information" (Sawalha, et al., 2012). A description 


of this corpus will follow shortly. 
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Since the Quran was punctuated hundreds of years ago in a style that is unique to it and 
BAQ reflected this punctuation system in its annotation, it became necessary that this 
corpus be also annotated in the modern European-inspired style of punctuation 
commonly used in MSA. The present research, therefore, modified the BAQ Corpus by 
adding the punctuation marks that were used in Sayyid Qutb's fi zilal al-quràn 
Exegesis. The reason why this particular book was used is that it is the only 
authoritative modern interpretation that reproduced the Quran text in an MSA 
orthographic form, with modern spelling and modern punctuation. Sayyid Qutb, as a 
modern exegesis, used his understanding of the Quran and his knowledge of Arabic 
grammar to punctuate the Quran text in a European-inspired style using the same 
punctuation marks of commas, full stops, exclamation marks, etc. that MSA currently 
uses. Based on the above, we have extracted punctuation marks from Sayyid Qutb's 
explanation and added them to the BAQ Corpus in one column i.e. one punctuation 


mark for each word and *nopunc" for the words that does not have any punctuation. 


One aim of the current research project is to investigate the most suitable machine- 
learning algorithm for automatic punctuation of Arabic texts, but the question is how 
detailed should the knowledge available for training be? POS tags are critical in any 
learning because knowledge of the parts of speech, word order, and syntactic structures 
guides sentential semantics and that guides punctuation. To answer the question, the 
BAQ Corpus contains two columns; one with coarse POS annotation, and the other with 
fine POS annotation. The Coarse POS (3-POS) tag sets are: noun, verb and particle, 
while the Fine POS (10-POS) tag sets are: noun, pronoun, nominal, adverb, verb, 


preposition, *làm prefixes, conjunction, particle, disconnected, letters. 
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Part of the BAQ Corpus version 2.0 is shown in Figure 3.3.1. The figure is structured as 
an 14-column table, such that columns 1-4 contain tracking information with the 1“ 
column being for the Quran chapter reference; the 2™ for the surah (passage) reference; 
the 3" for the ayah (verse) reference; the 4" for the index of the word within the verse 
(1 for the 1* word in the verse, 2 for the 274 3 for the 3", etc.). Since each word is 
stored in one cell in the 4" column, its annotation is stored in the opposite cells in the 
rest of columns. Thus, the 5" column contains the orthographic representation of a word 
entry in Othmani script; the 6" the word's MSA orthographic representation. The 7" 
and the 8" columns are respectively for the category in a three POS tag classification 
and a ten POS tag classification for each word in the 5^ column. the 9" column the 
punctuation mark that follows it in Sayyid Qutb's fr zilal al-quran Exegesis. The 
annotations used there are: Full-Stop '.'; comma ‘,’; semi-colon ‘;’; exclamation ???; 
question “2°; colon ‘:’; ellipsis <...?; and hyphen * *. When no punctuation mark follows 
the word, the annotation entered is ‘nopunc’. The 10" column represents the terminal 
annotation for each sentence in the corpus, where we adopted four punctuation marks to 
indicate the ending of sentences in the BAQ corpus; these punctuation marks are: Full- 
Stop ‘.’; ; semi-colon ‘;’; exclamation '!'; question “2”, 8366 sentences were produced 


from this step. The 11^ 


column is used for denoting the ending of the ayah according to 
the Othmani scripts of the Qur'an. The 12", 13° and 14” columns specify the phrase 


boundary annotation (break or non-break annotation) used in the research of (Sawalha, 


Brierley et al. 2012). 
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а 5322 аа С 5 |e? =, o FS 
51:|: ХВЕ 22 о Е ah к as wee 
8$ gls ssa Еи 418 уз еве 
5 2 al و 2 = ع‎ $ |5 о Е 
1 |2 М | PRONOUN | nopunc non-break 
1 | 4 | ов | المُحْسِنِينَ‎ | N NOUN break 
1 | 2 | Say | X N | ADVERB | nopunc non-break 
1/1 V VERB nopunc pc non-break 
1 |3 М | NOMINAL | порипс non-break 
1/5 N NOUN break 
111 М NOUN nopunc 7 ШЕЕЕІСІСІН 
1 |2 М ADVERB nopunc non-break 
111 a N ADVERB | nopunc non-break 
1 |2 V VERB nopunc 
1 |3 М | PRONOUN | nopunc EET non-break 
1 |5 Р PARTICLE | nopunc non-break 


Figure 3.3.1: A sample of the BAQ corpus with added modern punctuation marks. 


3.4 Machine Learning Algorithms 


Three ML algorithms were used in the research i.e. N-gram, HMM and CRF algorithms. 
These algorithms use different approaches for prediction. ML algorithms are 
probabilistic Computational Linguistics models. Probability represents the measuring of 
the portion of how certain events will occur. This portion is limited between 0 and 1. 
Computational Linguistic is a field concerned with having statistical measurements in 
the natural language field from computational perspective (Daniel and James, 2000). In 


this section we will describe the each algorithm in detail. 


39 


3.4.1 N-GRAM Model 


The N-gram model is one of the probabilistic language models that are concerned for 
predicting the next item in a sequence of states (e.g. word prediction as application of 
Natural Language Processing). Prediction using N-gram model is performed by 
computing the probability of a set of sequence states and selecting the highest 
probability sequence. The N-gram model has several uses in Natural Language 
Processing such as; Part-Of-Speech tagging, word similarity and predictive text input 
systems. In other fields such as Speech Recognition, it has been applied in some 
applications for Automatic Speech Recognition systems (Jurafsky and Martin 2007). In 
our study, we will employ the N-gram model for predicting punctuation marks for 
Arabic text by extracting a model which will be trained and tested on the Boundary 


Annotated Corpus (BAQ) of the Holy Quran. 


Probabilistic models used in Natural Language Processing applications is simply based 
on counting words, POS tags or other features such as punctuation marks for this 
research (Jurafsky and Martin, 2007). We developed an N-gram model that 
accomplishes a counting task for predicting punctuation marks for Arabic text. Our N- 
gram model uses the BAQ Corpus for training and testing. The BAQ Corpus was 


divided into two portions, 90% for training and 10% for testing. 


|” 


For example, to predict the punctuation marks after the word à ER which is a noun 
in the sentence [21:89] “ол jl ХА Gils تَدْرْنِي فَرْدَا‎ Y 227, This word is followed by a 
Full stop “.”. Then the probability 2). ثينَ)|‎ Jl, noun)) represents the count of this 
word is a noun and is followed by Full Stop in the training set divided by the count of 


this word appeared in the training corpus as a noun (Unigrams). Equation 1 represents 


this model: 
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CCo (oun) ) 
C(I noun) 


)1( = )152297 .)م 


The principle of the N-gram model is that we compute the probability of a given word 
or symbol based on the last few words. A number of models could be derived from the 
general N-gram model depending on the number of preceding words that would be 
included in the process of probability approximation. Based on that, the Bigram model 
could be used to compute the conditional probability of the current word taking into 


account the pervious word Р (и, [и 1). 


Furthermore, the Trigram model is used to approximate the probability of the current 
word based on the previous two words P(wn|Wn-2Wn-1). N-gram models use 
Maximum Likelihood Estimation (MLE) to compute the conditional probabilities. In 
a Bigram model, the probability of the word w; is computed through counting the 
number of occurrences of the current word у» preceded by the word ум, 1 (i.e. the 
count of the two words уу, and у, occurring together in the training dataset), and then 
it is divided by the number of times the word w,,_; has occurred in the training dataset. 


Equation 2 presents the MLE formula for the Bigram model. 


_ C (Wn-1Wn) 
P(wnlwn-1) = был (2) 


In general, we can represent Ше МГЕ estimation for the N-gram models by equation 3: 


C(Whone1Wn ) 


= (3) 
C (ил ү+1 


P(wn IW A+) = 


Where С (wf уи, ) represent the number of times the sequence of the current word 
W,, and the previous words wf, (depends on which N-gram model is used) are 


occurring in the training corpus. C(wl-ğ,,) represents the number of times the 
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sequence of the previous N words has occurred in the training corpus (Jurafsky and 


Martin 2007). 


This study aims to apply the N-gram models (Unigram, Bigram and Trigram models) on 
the BAQ Corpus in order to produce a model for predicting punctuation marks for 


Arabic text. So, how can we implement the N-gram models in this work? 


As we will explain in the next chapter, the BAQ Corpus is divided into two portions 
(training and testing portions), We will use the training part in order to produce our N- 
gram model. The N-gram model uses the word w, and its Part-Of-Speech tag р, and 
the punctuation tags t, as a feature set ((W,,p,)t,) for the task of predicting 
punctuation marks for Arabic text. The estimated probability of the tag t, given the 


word wp and its POS tag p, using the Unigram model is represented as: 


P(t, (ЖУЗ) = P((Wp Pa) [ё ).p(tn) (4) 


Equation 4 computes the estimation probability of the punctuation annotation tn, for 
the word w, which has the Part-Of-Speech tag pn. Equation 4 computes the estimation 
probability of the P((wn pa) |t, ) using Maximum Likelihood Estimation (MLE) as 


presented in equation 5: 


C((Wn , рь ), tn 
P((WnPn)|tn ) = Sn Pn a) (5) 


Where the C (Wn „Pn ), tn ) represents the number of times the word w, which has the 
POS tag p, was followed with punctuation mark t, in the training corpus. The 


C((w, ‚рь )) represents the number of times the word w, which has the POS tag py, 
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occurred in the training corpus. Notice that in equation 5, the probability of the 


C(tn ) 
C(tn) 


punctuation tag itself P(t, ) is equal to which is equal to 1. 


Now, we use the Bigram model to compute the estimation probability of the 
P (Wn 7 Pn)|tn ) as an equation 6 and equation 7, where the P(t, |t, .,) computes the 


probability of the current tag given the previous tag: 
P(tn |, Pn)) = P((wn, ра) |tn ). PCts 16-1) (6) 


C((Ws Pn), tn) " C(tn-1 tn) 


C((Wn, Pn)) C(tn-1) 52 


P(t, |е, pa) = 
The Trigram model is represented in equation 8 and equation 9 and the figure 3.4.1: 
P(t, |(Wn»Pn)) = P((Wn Pn) | ee ltn-2tn-1) (8) 


C((wn , Pn), tn) C(ts 5 Un 4 Un ) 


Ов)“ Сақа) 9 


P(t, |(Wn»Pn)) = 


For example, to compute the estimated probability of the Bigram model for the word 
“Os ЛЗ” which is а noun (POS tag) and followed with a Full stop *." in the verses “. сз» 


КЕБЕ) SA Gals 1358 (355 Y" using equations б and 7, we use the following equations: 
P | (ЗЕ! noun)) =P الوارثين))‎ noun)| .). P(. [nopunc) 


C الْوَارِثِينَ))‎ noun).) C(nopunc.) 


P ( (ЕЗІ) noun)) С((55 лз, noun)) C(nopunc) 


And for the Trigram model estimated probability is computed using equations 8 and 


equations 9: 
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Р (. (Ges noun) ) =Р (Ese) noun)| .). P(. |nopunc nopunc) 


C الْوَارِثِينَ))‎ noun), .( C(nopunc nopunc .) 
——————— € ----- 


à ( (CRE) noun)) 0 C(.) C (nopunc nopunc) 


One of the challenges for using the N-gram model is Sparse Data. Such challenge 
happens when data is not well represented in the training corpus. Therefore, rare cases 
will have zero estimation probability (Jelinek, 1980). Using a large training dataset 
could solve this problem, but while we are restricted with a limited corpus (BAQ) we 


have to think of another solution. 


Back-off models which also called Katz Back-off models (Katz, 1987) are one of the 
proposed models to solve sparse data problem. This model is based on consulting 
previous models (lower models) in order. This means, if we are looking to compute the 
estimated probability of a certain word in the training corpus using the Trigram model 
and we have found that this word has a zero estimation probability, then we would 
back-off to consult the lower model (i.e. the Bigram model), and so on until we find an 
estimation probability of that word greater than zero. For example, if we have a trigram 
of the parameters (x, y and z) then the estimated probability using the Back-off model 
will be represented in equation 10: 
Pratz(Zlx,y); if C(x, y,z) > 0 


Pkatz(Z\x, y) = } а(х, Y)Pratz(zly); elseif С(х,у) > 0 (10) 
Pratz (2Z); otherwise 
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pw 2 Pn-2)ltn-2) p((Wn- ,Pn-1)ltn-1) pw, Pn )|tn) 


Figure 3.4.1.1: The Trigram model. 
3.4.2 Hidden Markov Model 


The Hidden Markov Model (HMM) is one of the probabilistic sequence classifier 
models and one of the Marko’s models that are used to compute the probabilistic 
distribution for a sequence of words and assign the best sequence of labels with the 


highest probability estimation (Blunsom, 2004). 


HMM model has many applications in Natural Language Processing such as; Part-Of- 
Speech tagging, speech recognition systems, sentence segmentation, information 


extraction and of course punctuation marks prediction (Jurafsky and Martin, 2007) 


One of the prominent features of the HMM model is its ability of referring to the 
observed and hidden events jointly, where the word w, in a sequence of words 
(W1 W2 ...W,,) and the Part-Of-Speech tags pn could be presented as an observed event, 
and the punctuation annotation tag t, could be referred as hidden events in our task. The 


Hidden Markov Model consists of set of components as declared in table 3.4.2.1. 
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The Hidden Markov Model could be characterized in three stages: 


1. Computing likelihood: Given A — (A, B) denotes for a set of hidden states 


(punctuation marks in our case) and a sequence of observations (O) represented 


with words (w,) and it's Part Of Speech (p, ) labels. Then we compute the 


likelihood of p(O|A). 


2. Decoding: Finding out the best hidden states (best path) based on the given 
hidden states A = (A, B) and the observations (O) (words (w, ) with its POS 
labels (р,)). 

3. Learning: Learning the Hidden Markov Model how to tag the states A and B, 
based on the given sequence of observations (О) (words (м,) with its POS 
labels (р,)) and the set of hidden states ۸ = (A, B) of the HMM. 

Table 3.4.2.1: Components of Hidden Markov Model. 
ID | Symbol Meaning 


Q = 919293 IN 


A set on N nodes. 


2.) A= 11012 ... يبه‎ Ayn |А transition matrix probability A, where aj; 
represents the probability of transitioning from 
node i to node j, E ар = 1, Vi. 

3. O = 010303... От A sequence of T observations. 

4.| B = ;(0;) A sequence of observations likelihood (emission 
probability), denotes for the probability of 
observation о; to be produced from state b;. 

5.| qoqr A special start and end nodes, where qo node is 


connected with probably first hidden nodes with 
a transition probability ag; , and др is connected 
with the last hidden nodes with a transition 
probability а without any association with any 


Observations (O). 
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The following subsections will describe each of these phases and explains their 


application for predicting punctuation marks for Arabic text using the BAQ corpus. 
3.4.2.1 Computing Likelihood using Forward Algorithm: 


As we said before HMM is one of the Markov's models, as well as Markov chain is also 
one of the Markov's models, but the key difference between the two models (HMM and 
Markov chain) is that Markov chain does not have any hidden states, so to compute the 
probability of any sequence of observations (0102..От) we could easily multiply the 
probabilities for all these observations together. But for the HMM where there are a set 
of possible hidden states proposed for each observation (4142..4м) the procedures is 


different. The procedure is illustrated in Figure 4 represents. 


Based on the above we could use the joint probability to compute the likelihood of a 
sequence of observations (0) and a certain sequence of hidden states (Q) based on the 


equation 11: 


Р(0,0) = P(0|0) + P(Q) = | [pilad * | [Pada an 


But, because we do not know the certain sequence of hidden states we have to compute 
the likelihood of the sequence of observations with all possible sequences of hidden 
states which will cause a big complexity of (NT). Instead of using this highly expensive 


approach the Forward algorithm was used. 


The Forward algorithm is one of the dynamic programming algorithms with a 
complexity of O(N?T). This approach aims to compute the observation probability 


based on; (i) the likelihood (emission) probability of the observation b;(0;), (ii) the 
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transition probability of the node that produces this observation а;; and (iii) the 


previous path probability. Equation 12 represents the forward algorithm: 


N 


а(- У аға ay bj (o) (12) 


i-1 
Where: 
а,(/): The Forward probability of the observation o; given the hidden state j. 
&, 4(01): The Forward probability of the previous observation o;. ,. 


а: The transition probability from the previous hidden state q; to the current 


state qj. 


bj(0;): The state observation likelihood (emission probability) given the hidden 
state J. 
Figure 3.4.2.] represents a sample of the Forward algorithm for computing the 
Likelihood for a sequence of hidden states (q1q2) and the corresponding set of 
observations (010203), where the length of the sequence of the hidden states equal to the 
length of the sequence of the observations, taking into consideration that each hidden 


state (qw) is responsible for generating only one observation (от) at each time step. 


Figure 3.4.2.2 represents the Forward trellis for computing the likelihood of a sequence 
of hidden states for eight punctuation marks and the non-punctuation (nopunc) denoted 
with a circles and a sequence of observations representing the words with their three 


Part Of Speech tags for the first verse in the Holy Qur'an (дай с) А (الْحَمْدُ‎ denoted 


with squares. 
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Figure 3.4.2.1.1: Computing Likelihood using the Forward algorithm. 
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3.4.2.2 Decoding using the Viterbi Algorithm: 


Decoding is the process of detecting the best path of a sequence of hidden states 
(О = 9192 ...9м) that are responsible for generating a sequence of observations 
(О = 0405 ..От) based on the likelihood probability computed from the previous step 


(computing likelihood). Viterbi algorithm is used for the decoding process. 


Viterbi algorithm computes the maximum probability for each path that leading to the 
current hidden state (represented with the continuous black line in Figure 3.4.2.2.1) and 
store it in a Back pointers to keep track for each path (represented with the dashed line). 
Figure 3.4.2.2.1 presents the Viterbi algorithm. Equation 13 shows how to compute the 


Viterbi algorithm: 
0:00) = тах} Vve- (Dij رط‎ (oz) (13) 
Where: 
V, 4(i): The previous Viterbi path probability for the previous time step. 
aij: The transition probability from the previous state q; to the current state qj. 


b;(0;): The state observation likelihood for the current observation o; and the 


corresponding hidden state qj. 
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=== 


Figure 3.4.2.2.1: Computing maximum probability for each cell in the Viterbi trellis. 
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3.4.2.3 Learning (Training) the HMM using the Forward-Backward Algorithm: 


Training is the process of learning the HMM the parameters A (Transition probabilities) 
and B (Emission probabilities), given the sequence of observations O (unlabeled 
observations) and the set of possible hidden states Q using the Forward-Backward 
algorithm (Baum, 1972). The Forward-Backward algorithm can be used to find the most 
probable state in the sequence at any time step. In previous section we explain the 
Forward algorithm, know we will present the Backward algorithm as represented in 
equation 14. Backward algorithm computes the probability of observing the sequence of 


observations from t + 1 to the end, given we are in state i at time t. 


N 


p.) = > Qij b;(0t4+1) В:+10) (14) 


ј=1 

Where: 

B. (i): The Backward probability of the state i at time t. 

aij: The transition probability from state i to state j. 

bj (041): The emission probability of observation o at time t + 1 in state j. 


+10): The Backward probability of state j at time t + 1. 


Figure 3.4.2.3.1 shows the computation of the Backward probability of a sequence of 


observations at time t given s set of hidden states N. 


In general, HMM have some drawbacks. These drawbacks are: (i) inability to deal with 
multiple interacting features and (ii) its weakness of detecting long-range dependencies 


along sequence of observations (Lafferty, et al., 2001). In our task, there is a set of 
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interacting features such as; words, POS tags and punctuation tags. In addition, our task 
depends on a long-range of dependencies where the nature of punctuation marks 
prediction task depends on long sequences of words. Therefore, we expect some deficits 
in the performance of punctuation marks prediction task when we apply HMM 


algorithm. 
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by (0t44) 


by (O¢41) 


_— 


Figure 3.4.2.3.1: Computing the Backward probability of a sequence of observations at time T given state I. 
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3.4.3 Conditional Random Fields (CRF) 


Conditional Random Fields (CRF) is a framework used for building probabilistic 
models for sequence labeling and segmentation. CRF model solves the problem of 
generative models such as; Hidden Markov model, where these generative models try to 
maximize the joint probability over the observations and sequence labels. The joint 
probability problem is defined as making the computation of the emission probabilities 
of the observations (words) given a sequence of labels (hidden states) as the source 
problem through counting all possible sequences of observations. In contrast, 
conditional models such as, CRF, concentrates on computing the conditional probability 


of a sequence of labels where the sequence of observations is given. 


Furthermore, HMM model is restricted with a constant number of feature functions in 
order to compute the transition and emission probabilities. On the other hand, CRF 
model uses a varied number of later and earlier feature functions for inspecting a 
sequence of observations. These feature functions are used to compute probabilities of a 
sequence of labels. Using a number of feature functions to compute conditional 
probability in CRF model is an additional advantage over HMM model. These feature 
functions play a key role for detecting long range of dependencies between a sequence 


of hidden states and corresponding sequence of observations. 


We assume two random variables X and Y, where X = (x4, X2, X3, ... Хт) presents the 
sequence of observations to be labeled and Y = (y4, Y2, Уз, ... Ут) presents the sequence 
of labels (hidden states), X and Y have the same length. We will assume that the 
components of X are the sequence of words and there POS tags from the BAQ corpus, 
where the components of Y are the Punctuation or sentence terminal labels that we want 


to tag the observations X with them. Then, the condition of variable Y over variable X 
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is the conditional distribution of p(Y|X) without explicitly counting the probability of 


the observations p(x). 


We will define a graph G = (V, E) such that = (Ү,) „єр, such that Y presented with 
the vertices of G, then the CRF model that is conditioned on the random variables X 
with random variables Y, could be defined as: p(Y,|X, Y,, w # v), where w + v means 
that w and v are neighbors (Lafferty, McCallum et al. 2001). Equation 15 presents the 


general form of the CRF model: 


K T 
1 
pO) = سح‎ exp D: > алов) (15) 


k-1t-1 


Where Z(x) = Ууехр (YXk-iXt-iAkfk o Ye-1,X¢)) is the normalization 
function and the f, (уг, y', Xz) is a set of real valued feature functions where each 


feature function f; has a corresponding weight A; (Sutton and McCallum 2006). 


As we mentioned before, the feature function f, of the CRF model depends on the 
whole sequence of the observations x rather than one observation x;. Figure 3.4.3.1 
presents the undirected graph of the CRF model for a sequence of observations X and 


the corresponding sequence of observations Y. 
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Figure 3.4.3.1: The CRF model. 
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Chapter 4 


Design of Experiments 


This chapter discusses the implementation of the ML algorithms in the two tasks: 
punctuation marks prediction and sentence terminal prediction. At the beginning, it talks 
about the general form of the experiments and preparation of the BAQ corpus for the 
experiments. Then it declares the two classification problems (i.e. punctuation and 
sentence terminal prediction), and the design of NLTK code for processing these two 
tasks using three ML algorithms. An experiment of MSA text punctuation marks 
prediction is conducted using one of the ML algorithms that has the best performance 


accuracy. A conclusion is drawn at the end of all the experiments. 


4.1 Experiments in general 


4.1.1 Cross Validation Experiments 


As we explained in section 2.1.4 that verses of the Holy Qur'an were divided into 
Makki and Madani chapters; these verses differ in their characteristics. We have 
conducted cross validation experiments for the BAQ corpus for each phase of the 
experiments to make sure that each chapter of the Qur'an is fairly experimented, such 
that the text used for training and the text used for testing would be rearranged and 
interchanged as in the schematic permutation below, preserving that the training part 
and the testing part would always occupy 90% and 10% of the corpus respectively. 
Figure 4.1.1.1 presents the structure of the training and testing sets for each experiment 


of the ML algorithms in the Cross validation experiments. 


Testing 1096 


Training 10% Training 80% 
Training 20% Training 7096 
Training 3046 Training 6096 
Training 4096 Training 5096 
Training 50% Training 40% 
Training 60% Training 30% 
Training 70% Training 20% 
Training 80% Training 10% 


Testing 10% 


Training 90% 


сл 
| 


Training 90% 


Figure 4.1.1.1: The cross validation experiments. 


4.1.2 Preparing the Dataset (Training and Testing Datasets) 


In order to process the machine learning algorithms we have first of all to prepare the 
dataset. Therefore, we have two stages; firstly, breaking the BAQ corpus into sentences, 
secondly, spliting the BAQ corpus into training and testing parts for the cross 


validation experiments. 


4.1.2.1 Breaking the Dataset into Sentences 


Breaking the dataset into sentences requires reading the corpus and then determining 
where sentences end. We have adopted four punctuation marks to indicate the end of 
sentences (i.e. question “2”, exclamation “!”, full-stop “.” and semicolon * 5"). Based on 


that, 8366 sentences resulted. Each sentence consists of a sequence of words, where 


each word is connected with its POS and punctuation tags (i.e. word, POS tag, and 
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punctuation tag) in the nine class problem (i.e. punctuation marks prediction) and (i.e. 
word, POS tag, terminal tag) in the two class problem (i.e. sentence terminal 


prediction). 


For the punctuation marks prediction task (nine class problem), we have conducted two 
types of experiments for each category of POS tags (i.e. 3 POS tags and 10 POS tags). 
Based on that, the feature set for the 3 POS experiments consists of the word itself and 
its 3 POS tag and punctuation tag (i.e. word 3 POS tag, and punctuation tag). On the 
other hand, the feature set for the 10 POS experiments consists of the word itself and its 


10 POS tag and punctuation tag (i.e. word, 10 POS tag, and punctuation tag). 


In contrast, for the sentence terminal prediction task (two class problem), the feature set 
of the 3 POS tags experiment consists of the word itself and its 3 POS tag and terminal 
tags respectively (i.e. word, 3 POS tag, terminal tag), on the other hand, the feature set 
for the 10 POS experiment consists of the word itself and its 10 POS and terminal tags 
respectively (i.e. word, 10 POS tag, terminal tag). In order to break the BAQ Corpus 
into sentences an NLTK python code was written. The first step is to read the corpus 
and defining two variables sent. list and quran list. The sent. list variable used to store 
the sequence of words with their features (word, POS tag, and punctuation tag), that 
constructs a complete sentence. Each sentence terminal was defined based on the four 
punctuation marks; full-stop, exclamation, question and semicolon marks. All of the 
sentences would be stored in the quran list variable. Figure 4.1.2.1.1 presents the 
NLTK code for breaking the BAQ corpus. Figure 4.1.2.1.2 declares the process of 


breaking the BAQ Corpus. 
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def breaking quran corpus(): 


lines =  codecs.open (r'BAQ Corpus v2 with punctuations.txt',  'r', 
'utf-8').readlines( ) 
outfile = codecs.open (r'HMM pos3Nquran list.txt','w','utf-8') 
sent list = [ ] 
quran list = [ ] 
for line in lines: 
if line [0] == u'Nufeff': 
line = line[1:] 


if len(line)<=4: 


pass 
else: 
tokens = line.rstrip ( ).lstrip ( ).split ( ) 
word, punc, pos3, terminal, роѕ10 = tokens[5], tokens[6], 


tokens[7], tokens[8], tokens[9] 
sent list.append ((word, pos3, punc)) 
if terminal == u'terminal': 
quran list.append (sent list) 
for sent in sent list: 
outfile.write ('%Ss\t%s\t%s\n' 5  (sent[0], sent[1], 
sent[2])) 
sent list = [ ] 
return (quran list) 


Figure 4.1.2.1.1: NLTK code for breaking the BAQ Corpus into sentences. 


Qur'an list 


nopunc) 
nopunc) 


nopunc) 
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Breaking 
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Figure 4.1.2.1.2: Breaking the BAQ Corpus into sentences. 


62 


4.1.2.2 Splitting the Corpus (quran list) into Training, Testing and Gold Parts 


The second stage of preparing the dataset includes splitting the quran list file that was 
produced from the previous stage into training, testing and gold sets considering that the 
quran list file is the input file for this process. Splitting the corpus was repeated 10 
times taking into account that the training set would always occupy 90% of the dataset 
and the testing set would always occupy 1096 of the dataset, while the two datasets 
would be always rearranged and interchanged. Figure 4.1.2.2.1 presents the process 


used at this stage. 


Training part consists of a set of sentences; each sentence consists of a set of tokens 
(words) and each word connected with its corresponding POS and punctuation tags, tags 


that constitute the features that ML algorithms use for training. e.g. (G5!) , М, nopunc). 


The gold part and test parts have the same size and same sentences, except that the 
elements that form each sentence in the test set consist of the word itself and its POS tag 
e.g. (сой М), while the gold part consists of the word, its POS tag, and its punctuation 


tag e.g. (оза ‚М, D. 


After training ML algorithms using training sets in each run of ће 10 fold experiments, 
the produced model is used to tag each word in the test set with its appropriate 
punctuation tag. The produced set of words, POS tag and punctuation tag in the test set 
would then be compared with the corresponding word, POS tag, and punctuation tag in 


the gold set to measure the performance accuracy for each machine learning algorithm. 


Figure 4.1.2.2.2 presents a piece of code for splitting the quran list to generate training, 


testing and gold sets for the cross validation experiments runs, while Figure 4.1.2.2.3 
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presents the code for generating the training, testing, and gold datasets for each run of 


the cross validation experiments. 


quran list 


File 


Figure 4.1.2.2.1: Splitting the quran list for the cross validation experiments to generate 
the training and testing and gold datasets. 


def splitting quran list (quran corpus): 


start = 0 

len test lst = int (len (quran corpus) * 0.10) 
test list - [ ] 

train list - [ ] 


gold list - [ ] 
for i in range (10): 
if і == 
end - len (quran corpus) 


(0 

5 

о. 
ll 


start + len test 15% 
print ('start: $dNt End: $d ' $ (start, end)) 
test list = quran corpus[start:end] 


train list = quran corpus[:start] + quran corpus [end:] 

print ('Test: $dNt Train: $d ' $ ( len (test list), len ( 
train list))) 

start = end 


test sents - generate test file (test list, i) 
train sents = generate train file (train list, i) 


gold sents - generate gold file (test list, i) 


Figure 4.1.2.2.2: Splitting the quran list file for 10 times into Training and Testing 


/Gold sets. 
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def generate train file (train list, i): 

outfile = codecs.open (r'HMM pos3\train\train file $d.txt' $ 
l4 wi, utft-8t) 

for sent in train list: 


for token in sent: 
outfile.write ('%s $s %s\n' $ (token[0], token[1], 
token[2])) 
outfile.close( ( 
def generate test file (test list, i): 
outfile = codecs.open (r'HMM pos3\test\test file $d.txt' $ 
i,'w','utf-8') 
for sent in test list: 
for token in sent: 
outfile.write ('$s %s\n' 5 (token[0], token[1])) 
outfile.close ( ( 
def generate gold file (test list, i): 
outfile = codecs.open (r'HMM pos3\gold\gold file $d.txt' % 
i,'w','utf-8') 
for sent in test list: 
for token in sent: 


outfile.write ('$s $s %$s\n' $ (token[0], token[1], 
token[2])) 
outfile.close ( ) 


Figure 4.1.2.2.3: Codes for generating the train, test and gold dataset for each of the 
cross validation experiments. 


4.2 Punctuation Marks Prediction (Nine-Class Problem) and Sentence 


Terminal Prediction (Two-Class Problem) 


Punctuation marks prediction (Nine class problem) task is corresponding with 
experimenting three machine learning algorithms (i.e. N-Gram, HMM, and СКР), on 
the dataset corpus with both categories of POS tag sets (i.e. three-POS tag and ten-POS 
tag). After preparing the dataset, the three ML algorithms have to be applied each time 
of the ten experiments, by training the data set to produce a model that would be used to 
tag the test data set taking into account that the feature set composed of: the word itself, 
POS tag and punctuation. In this section we will present the way of work for each of the 


algorithms. 
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Sentence terminal prediction (Tow class problem) is one of the main topics in NLP. 
Breaking the sentences is the process concerned with braking long sentences into 
smaller ones, while the presence or absence of breaks between sentences causes a 


change in the meaning of the text (Agüero, et.al, 2003). 


For the sentence terminal prediction problem we have two types of classes (i.e. 
Terminal class denoted with “terminal”, and non-terminal class denoted with *non"). 
The terminal type indicates to the end or terminal of the sentence. We have adopted four 


сс 99 


punctuation marks that indicate sentence terminals i.e. Full stop “.”, Question mark “?”, 


d Edd 


Exclamation mark and Semi-colon mark “;”. The non-terminal type replaced inside 


of the sentence to indicate no terminals or breaks. 


As we have used BAQ corpus for punctuation annotation that consists of 77430 words, 
we will also use it for phrase break prediction task. Because of adopting four 
punctuation marks to indicate phrase breaks we have got 8366 breaks or sentences in 
the corpus. These data set we will be trained and tested using the three machine learning 
algorithms (taggers) (i.e. N-gram, HMM, and CRF) in order to investigate the best 


performance of the algorithms. 


Cross validation experiments would be conducted and two categories of POS tag set i.e. 
three-POS tag set and ten-POS tag set, would be experimented for each of the taggers. 
The corpus (dataset) also would be splited into two parts; training set that always will 
occupy 90% of the dataset and testing set that will occupy 10% of the dataset. The two 
sets of training and testing would be rearranged and interchanged as in the schematic in 
figure 4.1.1.1. We denote that the number of terminals in each of the reference datasets 


is equal to 836 terminals except for the last which was 842 terminals. 
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Two types of experiment would be conducted for phrase break prediction i.e. three-POS 
tag set experiments and ten-POS tag set experiments. The features used for training the 
three algorithms are: the word itself, three-POS tag set and terminal tag for the three- 
POS tag experiment, while the word itself, ten-POS tag set and terminal tag would be 
for the ten-POS tag experiment. Figure 4.2.2 presents the experiments of the 
punctuation marks prediction and sentence terminal prediction. An example of the BAQ 
corpus is presented in the below Figure 4.2.1, where the words are colored with yellow, 
three-POS tags are colored with green, the ten-POS tags are in orange, sentence breaks 
are colored with blue while the corresponding punctuation marks are colored with gray. 


A full description of the structure for the BAQ Corpus is presented in section 3.3. 


. + 
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SEs 228 25/28 8 3 5 99 
с 3 Е “ЕО — a + 6 8 к о + 
2 o | 5 5 < ف‎ oo cao 4 6 о A T 
Ф о | له‎ © PE EIS о о 5 50 9 б 
e Ф л | ж Ф 6 о б 6 0 в б = حل‎ 
+ p ool o UO a б. or c 
о. 2 Ф oO ro >Е = а. c о £ 
ندر‎ Ф > coc de O еп = 5 9 Га 
o = =з б = = = 
78 1 1 1 
78 1 1 2 
78 2 1 1 
78 2 1 2 
78 2 1 3 
78 3 1 1 
78 3 1 2 
78 3 1 3 
78 3 1 4 
78 4 1 1 
78 4 1 2 
78 5 1 1 
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Figure 4.2.1: An example of sentence terminals from the BAQ cropus. 
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Punctuation marks prediction Sentence terminal prediction 


(Nine class problem) 
(,/./2/!/3/:/_/../nopunc) 


(Two class problem) 


(Terminal, non-terminal) 


Punctuation Punctuation 


Sentence terminal Sentence terminal 


marks prediction 


with 3 POS tags 


marks prediction 


with 10 POS tags 


prediction with 3 prediction with 10 


POS tags feature POS tags feature 


feature feature 


Feature set Feature set Feature set Feature set 


{Word, {Word, {Word, {Word, 


3 POS tag, 10 POS tag, 3 POS tag, 10 POS tag, 


punctuation tag} punctuation tag} terminal tag} terminal tag} 


Figure 4.2.2: The expirements of punctuation marks prediction and sentence 
terminal prediction. 


4.2.1 N-GRAM Model 


In this subsection we will describe applying the N-gram algorithm for predicting 
punctuation marks for Arabic text. N-gram algorithm was discussed in detail in Section 
3.4.1. This Section details the application of N-gram to predict punctuation marks and 
sentence terminals for Arabic text. N-gram algorithm will be trained and tested using 
the BAQ Corpus. The feature set is used by the N-gram algorithm includes; (1) the word 
itself, (1) 3 POS tags or 10 POS tags, and (iii) the punctuation or sentence terminal 


tags. 


Applying the N-gram algorithm requires implicitly building three models; a Trigram 


model, a Bigram model, and a Unigram model. These models are built by computing 
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the estimated probability for each token in a sequence of observations. Building three 
N-gram models enables the tagger to get the benefit of the Back-off technique. This 
technique enables the higher model to consult the lower model (e.g. Trigram model (the 
higher model) consults the Bigram model (the lower model)) in the cases where the 
estimated probability for a certain word P((wn Dn) IE) in a sequence of observations 
tends to the value of zero. After building the N-gram model, 1 is applied to predict 
punctuation marks in the test set. Then, the results are evaluated using the gold standard 
dataset. The evaluation process compares the results of the N-gram model against their 
equivalent in the gold standard dataset and reports on a set of standard evaluation 


metrics. 


Figure 4.2.1.1 presents the NLTK code for training the N-gram model, where the 
default tagger value of “nopunc” is used when the estimated probability of any 
observation is equal to zero. The N-gram model is built by training on 9096 of the BAQ 
corpus sentences. After building the N-gram model, it is used to predict punctuation 
marks and sentence terminals on the test dataset which represents 1096 of the BAQ 


Corpus sentences. Figure 4.2.1.2 shows the code for testing the N-gram model. 


default tagger = nltk.DefaultTagger ('nopunc') 
unigram tagger-nltk.UnigramTagger(train sents,backoff-default tagger) 
bigram tagger-nltk.BigramTagger(train sents,backoff - unigram tagger) 


trigram tagger-nltk.TrigramTagger(train sents,backoff- bigram tagger) 


Figure 4.2.1.1: The code for training the N-gram models. 


tagged sents - [ ] 


for i in range ( len ( test sents )): 


tagged sents.append ( trigram tagger.tag ( test sents[i] )) 


Figure 4.2.1.2: The code for testing the N-gram model. 
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4.2.2 HMM Model 


This section presents the application of the HMM tagger for predicting punctuation 
marks and sentence terminals of Arabic text. The HMM model is trained and tested 
using the BAQ corpus. The feature set is used by the HMM algorithm includes; (1) the 
word itself, (ii) 3 POS tags or 10 POS tags, and (ш) the punctuation or sentence 


terminal tags. 


The application of the HMM tagger includes two main functions; load pun and 
run HMM tagger. The load pun function used for splitting and storing the words апа 
their POS tags into a variable called symbols set, and the punctuation tags into a 
variable called tag set. The two variables are stored in a cleaned sentences variable. 
Three variables ie. symbols, tag set and cleaned sentences are passed the 
run HMM tagger function. Running the HMM tagger produces a model for each of the 
ten experiments. The produced model used for tagging the test dataset for each run of 
the ten experiments. Figure 4.2.2.1 presents the two functions load pun and 


гоп HMM tagger. 
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def load pun(train sents): 


sentences = train sents 

tag re = re.compile (r'[*]l|--l[^-**-]-*') 
tag set = set () 

symbols = set () 

sent — [ ] 


cleaned sentences - [ ] 


print (len (sentences)) 
for sentence in sentences: 


for element in sentence: 
word, tag = element [0],element [1] 


symbols.add ((word)) # store each word with its POS 
tag 
tag set.add (tag) # punctuation marks 
sent.append ((word, tag)) # store cleaned-up tagged 
token (word , POS tag , Punc tag) 
cleaned sentences 4 [sent] 
sent =h] 
return cleaned sentences, list (tag set), list (symbols) 


def run HMM tagger (train sents, gold sents): 


labelled sequences, tag set, symbols - load pun (train sents) 
trainer = nltk.HiddenMarkovModelTrainer (tag set, symbols) 
hmm = trainer.train supervised (labelled sequences, 


estimator-lambdafd,bins: 
LidstoneProbDist(fd, 0.1, bins)) 
tagged sents - hmm.test (gold sents, verbos - True) 
return (tagged sents) 


Figure 4.2.2.1: HMM tagger code. 


4.2.3 CRF Model 


This section presents the application of the CRF model for predicting punctuation marks 
and sentence terminals for Arabic text. The CRF model was trained and tested using the 
BAQ Corpus. In contrast to the HMM tagger which is restricted with a constant number 
of feature functions, the CRF model is characterized in its advantage of using varied 
number of preceding and succeeding feature set. Because of these advantage, the CRF 
model is expected to be the best model for punctuation marks and sentence terminal 


prediction tasks for the Arabic text. 
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A set of preceding and succeeding features are defined to be CRF model feature set. 
These features are: the current word, and its POS tag, the succeeding word at position 
(+1), and its POS tag, the preceding word at position (-1), and its POS tag, the 
preceding word at position (-2), and its POS tag, and the preceding word at position (- 
3), and its POS tag. If the current word is the first word in the sentence then a “BOS” is 
printed, also if the current word is the last word in the sentence, then an *EOS" is 
printed. The essential feature is the current word and its POS tag. Table 4.2.3.1 shows 


these features. 


Table 4.2.3.1: The set of features used in the CRF model. 


Features Definition 
Сох [i, 0] The current word. 


дох |і, 1] The POS tag of the current word. 

Фох [i-1, 0] The previous word at position (-1). 

%х [i-1, 1] The POS tag of the word at position (-1). 
Сох [i-2, 0] The previous word at position (-2). 

%х [i-2, 1] The POS tag of the word at position (-2). 


Сох [i-3, 0] The previous word at position (-3). 
Фох [i-3, 1] The POS tag of the word at position (-3). 


Сох [i+1, 0] The succeeding word at position (+1). 


%x [i+1, 1] The POS tag of the word at position (+1). 


An NLTK python code was prepared for enrolling the selected feature set in the CRF 
model. Three functions code were designed using the NLTK python for preparing the 
feature set to be used in the CRF model. These features are: sent2features, 
word2features, and sent2punc. The sent2features function is used to pass each word in 
the sentence to the word2features function. Figure 4.2.3.1 presents the sent2features 
function. The word2features function is used to extract the feature set for each word in 


the sentence. Figure 4.2.3.2 presents the word2features function. The sent2punc 
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function is used for extracting punctuation tag for each word in the sentence. Figure 


4.2.3.3 presents the sent2punc function. 


def sent2features (sent): 


return [word2features (sent, i) for i in range (len (sent)) ] 


Figure 4.2.3.1: sent2feature function for passing each function to wor2feature function. 


def word2features (sent, i): 


word = sent [i][0] 
postag = sent [i][1] 
features = [ 
"мога =' + word, 
'postag =' + postag, ] 
PPLE + SUE 
wordl = sent [1-1] [0] 
postagl = sent [i-1] [1] 
word2 = sent [i-2] [0] 
postag2 = sent [1-2] [1] 
word3 = sent [i-3] [0] 
postag3 = sent [i-3] [1] 
features.extend ([ 


'-1:wordl =' + wordl, 
'-1:postagl =' + postagl, 
'-2:word2 =' + word2, 
'-2:postag =' + postag2, 
'-3:word3 =' + word3, 
'-3:postag3 =' + postag3,]) 


else: 
features.append ('BOS') 
if i « len(sent)-1: 
wordl = sent [i+1] [0] 
postagl = sent [i+1] [1] 
features.extend ([ 
'+1:мога1 =' + wordl, 
"+t1l:postagl =' + postagl,1]) 
else: 
features.append ('EOS') 
return (features) 


Figure 4.2.3.2: word2feature function for extracting features of each word in the 
sentence. 
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def sent2punc (sent): 


return [punctag for word, postag, punctag in sent] 


Figure 4.2.3.3: sent2punc for extracting punctuation tags from each observation. 


These two functions i.e. sen2features and sent2punc, are appended together to run the 


CRF trainer. Figure 4.2.3.4 presents the CRF tagger function 


def crf tagger (train sents, test sents, fileno): 


x train = [sent2features(s) for s in train sents] 
y train = [sent2punc(s) for s in train sents] 

x test = [sent2features(s) for s in test sents] 

y test - [sent2punc(s) for s in test sents] 
trainer = pycrfsuite.Trainer (verbose = False) 


for xseq, yseq in zip (x train, y train): 
trainer.append (xseq, yseq) 


trainer.set params ({ 


"elt LEO, 
'e2' т 1e=3; 
'max iterations' :50, 
'feature.possible transitions': True 
}) 
trainer.train ("ВАО crf test.crfsuite') # Training the model 


tagger = pycrfsuite.Tagger () 
tagger.open ("ВАО crf test.crfsuite') 
xample sent = test sents[0] 


print (' '.join (sent2word (example sent)), end='\n\n') 

print ("Predicted:", ' '.join (tagger.tag (sent2features 
(example sent)))) 

print ("Correct: ^", ' '.join (sent2punc (example sent))) 

y pred = [tagger.tag (xseq) for xseq in x test] 

outfile = codecs.open (r'result\crf results $d.txt'$fileno, 'w', 
'utf-8') 


Figure 4.2.3.4: CRF tagger function. 


After training and creating a model for each run of the ten experiments, these models 
are used to tag each sentence in the test sets. As an example, Figure 4.2.3.5 presents a 


sample of a sentence i.e. "és: e$ الله‎ 5p Y) أن تفوث‎ oi ,”وما کن‎ which was tagged using the 


CRF model with punctuation marks. 
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TE О الله‎ oso تَمُوت إلا‎ OT osi gus 


Predicted: nopunc nopunc nopunc nopunc nopunc nopunc nopunc nopunc nopunc. 


Gold : nopunc nopunc nopunc nopunc nopunc nopunc nopunc nopunc nopunc. 


Figure 4.2.3.5: A sample of tagged sentence with punctuation marks using the CRF 
model. 


4.3 Design of the Modern Standard Arabic Text Punctuation Marks 


Prediction Experiment 


This section presents the design of experiment for tagging the MSA text with 


punctuation marks using the ML algorithm that has the best performance scores. 


A text from Gba Xx. Sayyid Qutb book (3: 4 في‎ eua Mallem Fittareek” (Qutb 1979) 
were selected. The selected text consists of 3859 words without counting punctuation 
marks. A prerequisite for running the ML algorithm the text need to be tagged with 
coarse POS tags (i.e. 3 POS [noun, verb, and particle]), such that each word would be 
annotated with one of the POS tags. The Stanford POS tagger has been used for the 
tagging the text with POS tags. The obtained POS tags (i.e. 3-POS tags) for the MSA 
text was inaccurate; also it was tagged with fine POS tags (10-POS tags) which will not 
help the argument that the coarse POS tags (i.e. 3-POS tags) would be good for tagging 
an MSA text. Therefore, we have tended to tag the MSA text with coarse POS tags 
manually. Afterward, the MSA text was tokenized and isolated such that; each word, its 
POS tag, its and punctuation marks tag, are sorted into three opposite columns. Figure 


4.3.] presents a sample structure of the selected MSA text after processing. 


ل 
сл‏ 


порипс 


V 
البشرية‎ М порипс 
اليوم‎ N  nopunc 
تعلئ‎ P  nopunc 
dala М  nopunc 

N‏ الهاوية 
y P X nopunc‏ 
N  nopunc‏ بسبب 
xax! М  nopunc‏ 


Figure 4.3.1: sample of the text structure after processing. 


The produced file from the previous step which contains the words, POS tags, and 
punctuation tags is considered as the gold file. The gold file is passed to the ML 
algorithm. The ML algorithm will ignore the last column (punctuation marks) and tag 
the text with punctuation marks. The tagged text would be compared with the original 
file (gold file) to produce a confusion matrix with four values (i.e. TPs, TNs, FPs and 
FNs), to measure the accuracy of the ML model using the performance evaluation 


metrics. 


4.4 Summary of the Experiments 


To conclude, two tasks i.e. punctuation marks predicton (nine-class problem) and 
sentence terminal prediction (two-class problem), are expermiented using three ML 
algorithms (i.e. N-gram, HMM, and CRF algorithms). Two types of POS tags categories 
(i.e. 3-POS tags and 10-POS tags) are experimented for each task. The ML algorithms 
are trained and tested using the BAQ Corpus for both task. The training dataset would 
always occupy 90% and the test dataset would always occupy 10% of the BAQ Corpus. 
The ML algorithms performance are measured using perfomance evaluation metrics 
(i.e. Accuracy and BCR). An MSA text are tagged with punctuation marks using one of 
the ML algorithms that has the best performnce evaluation. The ML algorithm are 


trained on the whole BAQ Coprus to predict punctuation marks for the MSA text. 
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Therefor, the total number of conducted experiments is thirteen experiment. Table 4.4.1 


shows a short look for the design of all the experiments. 


Table 4.4.1: Summary of the experiments. 


Nine/ Two ML POS Train / Test / BAQ 
class Algorithm Categoriy BAQ Or MSA 
problem 

T. Nine class N-gram 3-pos 9096 1096 
2. Nine class N-gram 10-POS 9096 1096 
3. Nine class HMM 3-POS 9096 1096 
4. Nine class HMM 10-POS 9096 1096 
5: Nine class CRF 3-POS 90% 10% 
6. Nine class CRF 10-POS 90% 10% 
Us Two class N-gram 3-POS 90% 10% 
8. Two class N-gram 10-POS 90% 10% 
9. Two class HMM 3-POS 90% 10% 
10. Two class HMM 10-POS 90% 10% 
11. Two class CRF 3-POS 90% 10% 
12. Two class CRF 10-POS 90% 10% 
13. Nine class CRF 3-POS 100% BAQ 100% MSA 
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Chapter 5 


Experiments Results and Evaluation Discussion 


5.1 Introduction 


After preparing the BAQ corpus, breaking the corpus into sentences, splitting it into 
training and testing datasets, and designing the NLTK code for training and testing the 
ML algorithms to experiment with the corpus, the experiments of punctuation 


annotation and sentence terminal prediction were conducted. 


Two types of experiments are presented in this chapter; (1) punctuation marks prediction 
and (ii) sentence terminal prediction. Each of these experiments were applied using 
three ML algorithms (i.e. N-gram, HMM, and CRF). The ML algorithms use the word 
and POS tags as features for predicting punctuation marks and sentence terminals. 
Different evaluation metrics were used to compare the results of experiments. The 
algorithm that has the best performance results is used to punctuate an MSA text with 


punctuation marks. 


This chapter presents and discusses the results of the experiments. Section 5.2 discusses 
the evaluation metrics used for measuring the performance of the ML algorithms used 
in this research. Section 5.3 presents the problem of skewed data and the suitable 
evaluation metrics. Section 5.4 describes the format of the presentation of results. 
Section 5.5 presents the results of the conducted experiments. Finally, section 5.6 


presents a discussion of the obtained results. 
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5.2 Evaluation Metrics 


Evaluation process is "concerned with measuring the differences between the expected 
and the final results ... such as the evaluation value is restricted between 0 and 1” 


(Nakache, et al., 2005) 


Many machine learning algorithms used in NLP applications are used to solve the 
classification problems. Classification is the process of identifying the category of an 
instance, the category is from a set of predefined categories, and a classifier is trained 
on a dataset where the category for each instance is already known. Algorithms used in 
this research are considered classification algorithms. Two types of classification tasks 
are conducted in this research: (1) a nine-class problem (Punctuation marks prediction 


task) and (ii) a two-class problem (sentence terminal prediction task). 


The performance of classification algorithms is measured using evaluation metrics such 
as; Precision, Recall, F-score, Accuracy Rate and Balanced Accuracy Rate (BCR).. We 
are going to employ each of these measurements in order to compute the success rate in 


the performance of the algorithms which have been used in this research. 


Precision is defined as the ratio of the retrieved instances that are relevant to the query, 
while Recall is the ratio of the relevant instances that are retrieved. For example; if our 
algorithm correctly predicted 5 full-stop marks in a document which contains 50 
sentences. 10 sentences in the document are actually ended with full stop. The 
precision of our algorithm is the ratio of correct prediction of full-stop mark to the 
number of full-stops in the document (5/10). Recall is the ration of correct predictions to 


the total number of sentences in the document (5/50). 
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One of the tools used to visualize the performance of an algorithm is the Confusion 
Matrix. Columns of confusion matrix present the predicted observations while rows 
present the actual (gold) observations. Figure 5.2.1 and Figure 5.2.2 show the confusion 


matrix for the two classification problems. 


Precision, Recall, and other evaluation metrics are clearly defined on the basis of the 
confusion matrix and the number of True Positives, False Positives, True Negatives 
and False Negatives. For the nine-class problem, as shown in Figure 26, the four terms 
were defined as: 

1. True Positive (TP): Is the number of punctuation marks which are correctly 
predicted, e.g. a Full-stop is correctly predicted as a Full-stop and a comma is 
correctly predicted as a comma. 

2. False Positive (FP): Is the number of punctuation and non-punctuation marks 
that are incorrectly predicted, e.g. a comma is predicted as a full-stop and a non- 
punctuation mark (nopunc) is predicted as an exclamation mark. 

3. True Negative (TN): Is the number of non-punctuation (nopunc) marks that are 
correctly predicted as non-punctuation marks (nopunc). 

4. False Negative (FN): Is the number of punctuation marks that are predicted as 
non-punctuation marks, e.g. a comma is predicted as non-punctuation mark 


(nopunc). 


For the two-class problem as shown in Figure 27 the four terms were defined as 


follows: 


1. True Positive (TP): Is the number of sentence breaks predicted correctly as 
sentence breaks. 


2. False Positive (FP): Is the number of non-breaks predicted as sentence breaks. 
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3. True Negative (TN): Is the number of non-breaks predicted correctly as non- 
breaks. 


4. False Negative (FN): Is the number of sentence breaks predicted as non-breaks. 


Predicted 


Figure 5.2.1: TPs, TNs, FPs and FNs values for a confusion matrix of nine-class 
problem. 


Figure 5.2.2: TPs, TNs, FPs and FNs values for a confusion matrix of the two-class 
problem. 
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Based on the previous definitions, the performance measurements are defined as in the 


following equations: 


f T 
Precision = — (1) 
TP+FP 
TP 
Recall = ——— (2) 
TP+FN 
2 + Precision * Recall 2*TP 
Е — score = —————————— = ————— (3) 
Precision 4 Recall 2*TP+ ЕР+ FN 
TN 
Specificity = 4 
pecificity FP+TN (4) 
TP+TN 
Accuracy = لح‎ (5) 
TP+FN+FP+TN 


0.5 * TP 0.5 * TN 
TP+FN TN+FP 


Balanced Accuracy Rate (BCR) = (6) 


Precision, recall and F-score measures presented in equations 1, 2 and 3 respectively, 
focus only on positive predictions and omit any information about negative predictions 
(Powers, 2011). This means, these metrics will yield misleading results if the data is 
unbalanced (the number of classes vary greatly, i.e. number of nopunc is over presented 
in the dataset). We expect low precision, recall, and F-score rates because the true 
negative cases are the majority of the real and the predicted cases (observations) in our 
experiments. As so do, the Specificity measure; equation 4 omits information about the 
true positive cases and concentrates only on the true negative cases. We expect high 
Specificity rate. More detail about the problem of imbalanced data is discussed in 


Section 5.3. 


Based on that, we need performance measurements that take into account the correct 
positive and negative predictions. Such performance measurements are Accuracy and 
the BCR (equations 5 and 6). In this research we will give more concentration on the 


values of these two measurements. 
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5.3 Skewed Data 


Skewed data is the problem of imbalanced data distribution; “a type of classification 
problem, where some classes are highly underrepresented compared to other classes" 


(Wang and Yao, 2012). 


The skewed data causes problems to machine learning algorithms; i.e. they limit the 
ability to predict less under represented data. Classes in the skewed data are two types; 
(i) multi-minority classes, i.e. classes which are represented with a small ratio of data, 
and (ii) multi-majority classes, i.e. classes which are represented with a great ratio of 
data. Misclassification of the multi-minority in machine learning could harm the general 
performance of these algorithms. Two-class imbalanced data and 101111-95 


imbalanced data are two types of the skewed data problem (Wang and Yao, 2012). 


Many applications experience the problem of skewed data. In the medical field, for 
instance, symptoms are used to recognize patient's diseases; hence, misclassifying these 
symptoms could harm the prediction of rare diseases. Furthermore, skewed data 
problems in banking the systems could mislead in the prediction fraud (Longadge and 


Dongre, 2013). 


Data in the current research is also considered skewed. Two types of experiments have 
been conducted here: punctuation marks prediction (multi-class problem) and sentence 
break prediction (two-class problem). The dataset in the task of the multi-class problem 
are distributed into nine classes, but the class of non-punctuation (nonpunc) presents the 
big ratio of the data in the BAQ corpus. While, in the two-class problem, the dataset 
were distributed into two classes: sentence terminal and non-terminal, where the non- 


terminal (non) class also presents the big ratio of the data in the BAQ corpus. 
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The BAQ Corpus is used in this research for training and testing the three machine 
learning algorithms. To facilitate measurement of the performance of the machine 
learning algorithms, the predicted data are grouped into four categories: TPs, FPs, TNs, 
and FNs. Several performance metrics are used to measure the performance of these 
algorithms. Some of the used metrics such as; Precision, Recall, and F-Score omit TNs 
from their calculation (see section 5.2). Therefore, we need to use metrics that would 
deal with the problem of imbalanced data distribution without omitting any of the 
important information about the results of classification. The two measurements we will 


use are Accuracy and BCR. 


5.4 Format of Result Presentation 


This section describes the structure of the table of results. This table summarizes the 
results of cross validation experiments and rating of 6 performance metrics. The same 
table format is used to summarize the results of testing the three ML algorithms. The 
table is constructed of 16 columns and 24 rows. The 1* column lists the experiment 
numbers. The baseline in this column, presents the number of words that are not 
followed with punctuation marks in the testing parts. The 2" column presents the 
number of observations, i.e. the number of words in the testing dataset. The 3" states 
the number of punctuation marks or sentence terminals in the gold datasets, i.e. the 
number of words that are followed by punctuation marks (this is for punctuation marks 
prediction experiments), and the number of words followed by sentence terminal (this is 
for sentence terminal prediction experiments). The 4% column shows the number of 
non-punctuation marks (nopunc) in the gold datasets, i.e. number of words that are not 
followed by any punctuation mark (this is for punctuation marks prediction 


experiments), and the number of non-terminals in the gold datasets (this is for sentence 
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terminal prediction experiments). The 5? and the 6" columns present the number of 
predicted punctuation and non-punctuation marks respectively.. The T^. 88 9" and 10" 
present the values of ТР, TN, FP and ЕМ respectively. Note that, the values of TN and 
EN in the baseline rows are the same values as the non-punctuation and punctuation 
marks in the gold datasets respectively. These values were added to measure the 


baseline accuracy for each experiment. Baseline accuracy for each experiment is 


1% 18 


represented in the 11^ column. The boldface values in the 11 column are the accuracy 
rates of the ML algorithm in each experiment. The 12^. 13*. 14" columns list the 
values of obtained for Recall, Precision and F-score in each experiment. For these three 
columns, the baseline recall, precision and F-score are equal to zero. The 15" column 
lists the Specificity value in each experiment. The baseline Specificity equals one. The 
16" column, the last one, presents the BCR value in each experiment; the baseline BCR 
value being equals to 0.5. The bottom two rows in the table present the average 
performance metric and the average baseline accuracy and BCR. Table 5.4.1 presents an 


example of the HMM model results for punctuation marks prediction formatted in a 


table. 
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Table 5.4.1: Example of results table for testing HMM on the BAQ Corpus. 


Gold Prediction 
Non- Non- F- 
Exp. # N Punctuation | Punc | Punctuation | Punc | TP | TN FP FN | Accuracy | Recall | Precision | score | Specificity | BCR 
Baseline | 8523 1619 6904 0 8523 | 0 | 6904) 0 1619 0.810 0.000 0.000 0.000 1.000 0.500 
1 8523 1619 6904 895 7628 | 536 | 6679 | 359 | 949 0.847 0.361 0.599 0.450 0.949 0.655 
Baseline | 9378 1900 7478 0 9378 | 0 | 7478) 0 1900 0.797 0.000 0.000 0.000 1.000 0.500 
2 9378 1900 7478 845 8533 | 491 | 7280 | 354 | 1253 | 0.829 0.282 0.581 0.379 0.954 0.618 
Baseline | 9810 1853 7957 0 9810 | 0 | 7957) 0 1853 0.811 0.000 0.000 0.000 1.000 0.500 
3 9810 1853 7957 1107 8703 | 642 | 7670 | 465 | 1033 | 0.847 0.383 0.580 0.462 0.943 0.663 
Baseline | 8517 1659 6858 0 8517 | 0 | 6858) 0 1659 0.805 0.000 0.000 0.000 1.000 0.500 
4 8517 1659 6858 874 7643 | 546 | 6671 | 328 | 972 0.847 0.360 0.625 0.457 0.953 0.656 
Baseline | 8050 1490 6560 0 8050 | O | 6560) 0 1490 0.815 0.000 0.000 0.000 1.000 0.500 
5 8050 1490 6560 753 7297 | 433 | 6365 | 320 | 932 0.844 0.317 0.575 0.409 0.952 0.635 
Baseline | 6737 1345 5392 0 6737 | 0 | 5392) 0 1345 0.800 0.000 0.000 0.000 1.000 0.500 
6 6737 1345 5392 874 5863 | 556 | 5214 | 318 | 649 0.856 0.461 0.636 0.535 0.943 0.702 
Baseline | 7129 1344 5785 0 7129); 0 |5785 | 0 1344 0.811 0.000 0.000 0.000 1.000 0.500 
7 7129 1344 5785 833 6296 | 506 | 5595 | 327 | 701 0.856 0.419 0.607 0.496 0.945 0.682 
Baseline | 7163 1382 5781 0 7163 | 0 | 5781) 0 1382 0.807 0.000 0.000 0.000 1.000 0.500 
8 7163 1382 5781 794 6369 | 511 | 5606 | 283 | 763 0.854 0.401 0.644 0.494 0.952 0.677 
Baseline | 6919 1347 5572 0 6919 | 0 | 5572) 0 1347 0.805 0.000 0.000 0.000 1.000 0.500 
9 6919 1347 5572 632 6287 | 353 | 5385 | 279 | 902 0.829 0.281 0.559 0.374 0.951 0.616 
Baseline | 5204 1206 3998 0 5204) 0 |3998! 0 1206 0.768 0.000 0.000 0.000 1.000 0.500 
10 5204 1206 3998 462 4742 | 216 | 3869 | 246 | 873 0.785 0.198 0.468 0.279 0.940 0.569 
Performance Metrics Average 0.839 0.346 0.587 0.433 0.948 0.647 
Average Accuracy Baseline 0.803 Average BCR Baseline 0.500 
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5.5 Experiments Results 


After preparing the BAQ Corpus and designing the ML algorithms, the proposed 
experiments were conducted. This section presents the results for the two classification 


tasks; (1) prediction of punctuation marks and (ii) prediction of sentence terminals. 


5.5.1 Prediction of Punctuation Marks (Nine Class Problem) 


Here are the results of punctuation mark prediction experiments for each of the ML 
algorithms, i.e. N-gram, HMM and CRF respectively, for both POS-tag categories (i.e. 


3 POS and 10 POS). 


5.5.1.1 N-gram Algorithm 


The N-gram model is one of the probabilistic language models that are concerned with 
predicting the next item in a sequence of observations. Predicting with the N-gram 
model is performed by computing the probability of a sequence of observations and 
selecting the highest probability sequence. The N-gram model has many applications in 
Natural Language processing, such as segmentation, Part Of Speech tagging, etc. More 
detail about this model can be found in section 3.4.1. The N-gram model was trained 
and tested on the BAQ Corpus for the prediction of punctuation marks. The obtained 


results for both categories of POS-tags are described in the next two subsections. 
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5.5.1.1.1 Results of N-gram Punctuation Marks Prediction using a Three-POS Tag 


Set 


After training and testing the N-gram tagger on the BAQ Corpus for the cross validation 


experiments, table 5.5.1.1.1.1 was obtained. 


The N-gram algorithm in these experiments uses the word and POS tag (i.e. 3 POS 
[noun, verb, and particle]) to predict the punctuation mark or nopunc after each word in 
the test dataset. The average accuracy rate for the 10 experiments is 83.9% with a 3.6% 
increment above the average baseline accuracy of 80.3%. We also gain an increment of 


14.7% in average ВСК; i.e. a score of 64.7% above the BCR baseline of 50%. 


As explained in section 5.2, the recall, precision, and F-score yield misleading results 
when the data is imbalanced. The results were very low because these metrics take into 
account only correct positive predictions. Positive cases (i.e. punctuation marks) 
represent less than 2096 compared to more than 8046 negative cases (i.e. nopunc) in the 
sample since the majority of words in a standard piece of text are followed by no 
punctuation marks whatsoever. Average recall scored 34.6%, average precision 58.7% 
and average, and F-score 43.396. On the other hand, because specificity represents 
correct predictions of negative cases (i.e. how many nopunc was predicted as nopunc), 


the obtained specificity was score 94.896. 


Because our dataset is skewed towards nopunc, we will rely, in our evaluation of ML 
algorithm's performance, on Accuracy rate and BCR as they do take correct negative 
predictions into account. For each experiment, we listed the positive and negative 
prediction values. TPs and TNs represent the correct predictions of punctuation marks 
and nopunc respectively. FPs and FNs represent false predictions of punctuation marks 


and nopunc respectively. TPs are a primitive measure of an algorithm's performance. 
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5.5.1.1.2 Results of N-gram Punctuation Marks Prediction using a Ten-POS Tag 


Set 


Table 5.5.1.1.1.2 below displays the results of N-gram's model punctuation prediction 
in 10 experiments. To predict a punctuation mark or nopunc after each word in the 
dataset in each experiment, the N-gram model uses the word and POS tag (i.e. 10 POS 
[Noun, Verb, Nominal, etc.]) as features. The average accuracy rate obtained in the 10 
experiments is 83.3% at a 3.3% increment above the average baseline accuracy of 
80.3%. Similarly, the average BCR was a score of 65% at an increment of 15% above 


the average BCR baseline of 50%. 


As explained in the previous section and section 5.2, the recall, precision, and F-score 
accuracy measures are misleading because the dataset is imbalanced. The average recall 
score is 36.596, average precision score 53.896, and the average F-score 43.396. The 
specificity score, on the other hand, is 93.4%. We will concentrate on the accuracy and 


BCR measures when comparing results. 


Figure 5.5.1.1.1 below shows a sample of text from the Qur'an where punctuation 
marks were automatically added using the N-gram algorithm. In this example, the N- 
gram algorithm used 3 POS tags and the word as features for predicting the punctuation 
marks. Corresponding between the original Qur'anic punctuation and the automatic 


punctuation are highlighted in green. Cases of discrepancy are highlighted in red. 
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Original punctuated Quran text 


| дь аз cu Y ОА الإ ذلك‎ 


co ith Абу os DLA 5 Lot يون‎ ЫЙ Log 
م‎ 34s ЛЕНТА ТРАН Дав من‎ GA وما‎ ad p од Sal 
Jes ее es әй de الله‎ a by uy Y ht أم لم‎ ДЫЙ oce vta „Ж sal S) [odi 
29225 оред وما هم‎ LL eds باه‎ EX уа الاس مَنْ‎ Gay Бре Sle ids Буз > ja 
la АЙ Side а اله‎ А 55 يَشْعْرُونَ! في ويم‎ Us ge. 52255 ый cuis a 


(за АМДЫ E في‎ VILE لهم لا‎ a by 4s ue 


Automatic punctuation of the Quran text 


>йз ТОЛСЫН [os эми ый [tal ia [ә ca ذلك اكاب لا‎ [o 
|22 adis fiss ie على‎ Ud ifs ss ues [ots من‎ Dy وما‎ 94 GA озек 
I ds perds ub ambi لم‎ рео 
Selly الله‎ so о وما هم‎ Le ds дыра ومن الاس من‎ ше عدَابٌ‎ diss 


|; ues ef оз ы а ы شغڙون] في‎ s 80 e s [ыл 


буза ыш арыла ра 9; 


Figure 5.5.1.1.1: Automatic prediction of punctuation marks using the N-gram model 
with 3-POS tags. 
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Table 5.5.1.1.1.1: Results of punctuation marks prediction using the N-gram algorithm with 3-POS tags. 


Gold Prediction 
Non- Non- F- 
Exp. # N Punctuation | Punc | Punctuation | Punc | TP | TN FP FN | Accuracy | Recall | Precision | score | Specificity | BCR 
Baseline | 8523 1619 6904 0 8523 | 0 | 6904) 0 1619 0.810 0.000 0.000 0.000 1.000 0.500 
1 8523 1619 6904 895 7628 | 536 | 6679 | 359 | 949 0.847 0.361 0.599 0.450 0.949 0.655 
Baseline | 9378 1900 7478 0 9378 | 0 | 7478) 0 1900 0.797 0.000 0.000 0.000 1.000 0.500 
2 9378 1900 7478 845 8533 | 491 | 7280 | 354 | 1253 | 0.829 0.282 0.581 0.379 0.954 0.618 
Baseline | 9810 1853 7957 0 9810 | 0 | 7957) 0 1853 0.811 0.000 0.000 0.000 1.000 0.500 
3 9810 1853 7957 1107 8703 | 642 | 7670 | 465 | 1033 | 0.847 0.383 0.580 0.462 0.943 0.663 
Baseline | 8517 1659 6858 0 8517 | 0 | 6858) 0 1659 0.805 0.000 0.000 0.000 1.000 0.500 
4 8517 1659 6858 874 7643 | 546 | 6671 | 328 | 972 0.847 0.360 0.625 0.457 0.953 0.656 
Baseline | 8050 1490 6560 0 8050 | 0 | 6560) 0 1490 0.815 0.000 0.000 0.000 1.000 0.500 
5 8050 1490 6560 753 7297 | 433 | 6365 | 320 | 932 0.844 0.317 0.575 0.409 0.952 0.635 
Baseline | 6737 1345 5392 0 6737 | 0 | 5392) 0 1345 0.800 0.000 0.000 0.000 1.000 0.500 
6 6737 1345 5392 874 5863 | 556 | 5214 | 318 | 649 0.856 0.461 0.636 0.535 0.943 0.702 
Baseline | 7129 1344 5785 0 7129| 0 |5785 | 0 1344 0.811 0.000 0.000 0.000 1.000 0.500 
7 7129 1344 5785 833 6296 | 506 | 5595 | 327 | 701 0.856 0.419 0.607 0.496 0.945 0.682 
Baseline | 7163 1382 5781 0 7163 | 0 | 5781) 0 1382 0.807 0.000 0.000 0.000 1.000 0.500 
8 7163 1382 5781 794 6369 | 511 | 5606 | 283 | 763 0.854 0.401 0.644 0.494 0.952 0.677 
Baseline | 6919 1347 5572 0 6919 | 0 | 5572) 0 1347 0.805 0.000 0.000 0.000 1.000 0.500 
9 6919 1347 5572 632 6287 | 353 | 5385 | 279 | 902 0.829 0.281 0.559 0.374 0.951 0.616 
Baseline | 5204 1206 3998 0 5204 | 0 |3998! 0 1206 0.768 0.000 0.000 0.000 1.000 0.500 
10 5204 1206 3998 462 4742 | 216 | 3869 | 246 | 873 0.785 0.198 0.468 0.279 0.940 0.569 
Performance Metrics Average 0.839 0.346 0.587 0.433 0.948 0.647 
Average Accuracy Baseline 0.803 Average BCR Baseline 0.500 
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Table 5.5.1.1.1.2: Results of punctuation annotation using the N-gram algorithm with 10-POS tags. 


Gold Prediction 
Non- Non- 
Exp. # N Punctuation | Punc | Punctuation | Punc | TP TN FP FN |Accuracy | Recall | Precision | F-score | Specificity | BCR 
Baseline | 8523 1619 6904 0 8523 0 6904 0 1619 0.810 0.000 0.000 0.000 1.000 0.500 
1 8523 1619 6904 1021 7502 |553 | 6600 | 468 | 902 0.839 0.380 0.542 0.447 0.934 0.657 
Baseline | 9378 1900 7478 0 9378 0 7478 0 1900 0.797 0.000 0.000 0.000 1.000 0.500 
2 9378 1900 7478 990 8388 | 514 | 7193 | 476 | 1195 | 0.822 0.301 0.519 0.381 0.938 0.619 
Baseline | 9810 1853 7957 0 9810 0 7957 0 1853 0.811 0.000 0.000 0.000 1.000 0.500 
3 9810 1853 7957 1245 8565 | 685 | 7587 | 560 | 978 0.843 0.412 0.550 0.471 0.931 0.672 
Baseline | 8517 1659 6858 0 8517 0 6858 0 1659 0.805 0.000 0.000 0.000 1.000 0.500 
4 8517 1659 6858 967 7550 | 567 | 6618 | 400 | 932 0.844 0.378 0.586 0.460 0.943 0.661 
Baseline | 8050 1490 6560 0 8050 0 6560 0 1490 0.815 0.000 0.000 0.000 1.000 0.500 
5 8050 1490 6560 883 7167 | 453 | 6270 | 430 | 897 0.835 0.336 0.513 0.406 0.936 0.636 
Baseline | 6737 1345 5392 0 6737 0 5392 0 1345 0.800 0.000 0.000 0.000 1.000 0.500 
6 6737 1345 5392 968 5769 | 565 | 5144 | 403 | 625 0.847 0.475 0.584 0.524 0.927 0.701 
Baseline | 7129 1344 5785 0 7129 0 5785 0 1344 0.811 0.000 0.000 0.000 1.000 0.500 
7 7129 1344 5785 941 6188 | 530 | 5534 | 411 | 654 0.851 0.448 0.563 0.499 0.931 0.689 
Baseline | 7163 1382 5781 0 7163 0 5781 0 1382 0.807 0.000 0.000 0.000 1.000 0.500 
8 7163 1382 5781 889 6274 | 542 | 5554 | 347 | 720 0.851 0.429 0.610 0.504 0.941 0.685 
Baseline | 6919 1347 5572 0 6919 0 5572 0 1347 0.805 0.000 0.000 0.000 1.000 0.500 
9 6919 1347 5572 743 6176 | 368 | 5319 | 375 | 857 0.822 0.300 0.495 0.374 0.934 0.617 
Baseline | 5204 1206 3998 0 5204 0 3998 0 1206 0.768 0.000 0.000 0.000 1.000 0.500 
10 5204 1206 3998 506 4698 | 209 | 3821 | 297 | 877 0.774 0.192 0.413 0.263 0.928 0.560 
Performance Metrics Average 0.833 0.365 0.538 0.433 0.934 0.650 
Average Accuracy Baseline 0.803 Average BCR Baseline 0.500 
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5.5.1.2 HMM Algorithm 


The Hidden Markov Model (HMM) is one of the probabilistic sequence classifiers. 
HMM is used to compute the probabilistic distribution of a sequence of observations 
and assign the best sequence of labels that are with the highest probability for 
occurrences. The HMM model has many applications in Natural Language Processing 
(NLP), such as; part of speech tagging, sentence segmentation, and information 


extraction. More detail about this model is in section 3.4.2. 


The next two sections present the results of punctuation marks prediction using the 
BAQ Corpus for training and testing the HMM. Both categories of POS tag (i.e. 3 POS 


and 10 POS) are used together with word sequences as features in the experiments. 


5.5.1.2.1 Results of HMM Punctuation Marks Prediction using a Three-POS Tag 


Set 


Table 5.5.1.2.1.1 presents the results of predicting the punctuation marks in the BAQ 
Corpus using the HMM model in 10 experiments. The word and POS tag (i.e. 3 POS 
[noun, verb, and particle]) are used by the HMM model for predicting punctuation mark 
or nopunc after each word in the dataset. The average accuracy scored is 84.1% with at 
a 3.8% increment above the average baseline accuracy of 80.3% in the 10 experiments. 
The average BCR rate is 63.9% at an increment of 13.9% above the average BCR 


baseline of 5096. 


On the other hand, the average recall is scored at 32%, the average precision at 62%, 
and the average F-score scored at 41.9%. Furthermore, the average specificity is scored 
at 95.8%. Since the dataset used in this research is imbalanced (as explained in section 


5.2), the recall, precision, F-score, and specificity measurements are misleading; 
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therefore, the measurements for comparing experiment results shall be are the accuracy 


and BCR measurements. 


5.5.1.2.2 Results of HMM Punctuation Marks Prediction using a Ten-POS Tag Set 


This subsection presents the results of punctuation mark prediction using the ten POS 


tags feature for training and testing the HMM algorithm on the BAQ Corpus. 


Table 5.5.1.2.2.1 shows the results of this punctuation mark in the 10 experiments. The 
HMM model uses the word and POS tag (i.e. 10 POS [noun, adverb, pronoun, etc.]) 
features for predicting whether each word is to be followed by a punctuation mark or by 
nopunc. The average accuracy rate for the 10 experiments is scored at 84.196 at a 3.8 
increment above the average baseline accuracy of 80.3%. We also gain an increment of 
13.7% in average BCR value above the 50% baseline, the attained BCR rate being 


63.7%. 


The average rates of recall, precision, and F-score are respectively 31.6%, 62.4%, and 
41.7%. Specificity, on the other hand, averages 95.9%. As explained in earlier sections, 
none of these metrics will be commented on here because they are essentially 
misleading when the data set is skewed. We will use instead the accuracy and BCR 


measurements for comparing experiment results. 


Figure 5.5.1.2.1 illustrates HMM's automatic punctuation prediction in a piece of 
Qur'anic text. The HMM model uses the word and the 3-POS tag set as features for 
predicting punctuation marks. Agreement between the original Qur'anic punctuation 
and automatically punctuate highlighted in green. Cases of disagreement are highlighted 


in red. 
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Original punctuated Quran text 

(6989 E s Lt e ы job قذ‎ ts sut ДӘ من‎ elo) ua J SERIA н 
ы НЫ على ما‎ Sas الله‎ Мел д cites أو‎ hia oll эн 95) 0% 
qe AU لا‎ LS Возле BSG) Jedi Ua АНА Y By Lasalle ga duos جاء‎ 
سورة يونس‎ [pass Ys Sle يَسْتَأَخِرُونَ‎ Уз ade ҚАҒИ الا‎ aa ضرا ولا فعا إلا ما‎ 


(£4-£0) 


Automatic punctuation of the Quran text 

| 96s in m VSS ad e قذ‎ eis n RA من‎ Ela s d ERR es 
$$ [3,535 ES ois ما‎ de ab الله‎ E е» а ids أو‎ Аы ой بخض‎ 35) Us 
1 ыы AS إن‎ Sesh هذا‎ МЕЗ а оза 
|22 J; 22 (БАҒАН VECINA] СИЕ Jj us y. ts I. 


سورة بوشن )64-60( 


Figure 5.5.1.2.1: HMM automatic punctuation mark prediction in a Qur'anic text using 
a 3-POS tag set. 
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Table 5.5.1.2.1.1: HMM punctuation mark prediction a 3-POS tag set. 


Gold Prediction 
Non- Non- 
Exp. # N Punctuation | Punc | Punctuation | Punc | TP | TN FP FN | Accuracy | Recall | Precision | F-score | Specificity | BCR 
Baseline | 8523 1619 6904 0 8523 0 | 6904 | O | 1619 0.810 0.000 0.000 0.000 1.000 0.500 
1 8523 1619 6904 786 7737 | 494 | 6721 | 292 | 1016 | 0.847 0.327 0.628 0.430 0.958 0.643 
Baseline | 9378 1900 7478 0 9387 O | 7478) 0 | 1900 0.797 0.000 0.000 0.000 1.000 0.500 
2 9378 1900 7478 740 8638 | 458 | 7326 | 282 | 1312 | 0.830 0.259 0.619 0.365 0.963 0.611 
Baseline | 9810 1853 7957 0 9810 О | 7957; 0 | 1853 0.811 0.000 0.000 0.000 1.000 0.500 
3 9810 1853 7957 954 8856 | 587 | 7746 | 367 | 1110 | 0.849 0.346 0.615 0.443 0.955 0.650 
Baseline | 8517 1659 6858 0 8517 0 |6858 | 0 |1659 0.805 0.000 0.000 0.000 1.000 0.500 
4 8517 1659 6858 773 7744 | 523 | 6719 | 250 1025! 0.850 0.338 0.677 0.451 0.964 0.651 
Baseline | 8050 1490 6560 0 8050 O | 6560| 0 | 1490 0.815 0.000 0.000 0.000 1.000 0.500 
5 8050 1490 6560 675 7375 | 399 | 6401 | 276 | 974 0.845 0.291 0.591 0.390 0.959 0.625 
Baseline | 6737 1345 5392 0 6737 0 |5392 | 0 | 1345 0.800 0.000 0.000 0.000 1.000 0.500 
6 6737 1345 5392 788 5949 | 517 | 5243 | 271 | 706 0.855 0.423 0.656 0.514 0.951 0.687 
Baseline | 7129 1344 5785 0 7129 0 |5785| 0 | 1344 0.811 0.000 0.000 0.000 1.000 0.500 
7 7129 1344 5785 756 6373 | 478 | 5623 | 278 | 750 0.856 0.389 0.632 0.482 0.953 0.671 
Baseline | 7163 1382 5781 0 7163 О | 5781} O | 1382 0.807 0.000 0.000 0.000 1.000 0.500 
8 7163 1382 5781 736 6427 | 501 | 5638 | 235 | 789 0.857 0.388 0.681 0.495 0.960 0.674 
Baseline | 6919 1347 5572 0 6919 0 |5572 | 0 | 1347 0.805 0.000 0.000 0.000 1.000 0.500 
9 6919 1347 5572 569 6350 | 330 | 5411 | 239 | 939 0.830 0.260 0.580 0.359 0.958 0.609 
Baseline | 5204 1206 3998 0 5204 0 | 3998] 0 | 1206 0.768 0.000 0.000 0.000 1.000 0.500 
10 5204 1206 3998 383 4821 | 200 | 3900 | 183 | 921 0.788 0.178 0.522 0.266 0.955 0.567 
Performance Metrics Average 0.841 0.320 0.620 0.419 0.958 0.639 
Average Accuracy Baseline 0.803 Average BCR Baseline 0.500 
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Table 5.5.1.2.2.1: HMM punctuation mark prediction a 10-POS tag set. 


Gold Prediction 
Non- Non- 
Exp. # N Punctuation | Punc | Punctuation | Punc | TP | TN FP FN | Accuracy | Recall | Precision | F-score | Specificity | BCR 
Baseline | 8523 1619 6904 0 8523 | 0 | 6904) 0 1619 0.810 0.000 0.000 0.000 1.000 0.500 
1 8523 1619 6904 779 7744 | 491 | 6725 | 288 | 1019 | 0.847 0.325 0.630 0.429 0.959 0.642 
Baseline | 9378 1900 7478 0 9378 | O | 7478) 0 1900 0.797 0.000 0.000 0.000 1.000 0.500 
2 9378 1900 7478 727 8651 | 453 | 7328 | 274 | 1323 | 0.830 0.255 0.623 0.362 0.964 0.610 
Baseline | 9810 1853 7957 0 9810 | 0 | 7957) 0 1853 0.811 0.000 0.000 0.000 1.000 0.500 
3 9810 1853 7957 961 8849 | 593 | 7744 | 368 | 1105 | 0.850 0.349 0.617 0.446 0.955 0.652 
Baseline | 8517 1659 6858 0 8517 | 0 | 6858) 0 1659 0.805 0.000 0.000 0.000 1.000 0.500 
4 8517 1659 6858 769 7748 | 521 | 6720 | 248 | 1028 | 0.850 0.336 0.678 0.450 0.964 0.650 
Baseline | 8050 1490 6560 0 8050 | O | 6560) 0 1490 0.815 0.000 0.000 0.000 1.000 0.500 
5 8050 1490 6560 672 7378 | 392 | 6399 | 280 | 979 0.844 0.286 0.583 0.384 0.958 0.622 
Baseline | 6737 1345 5392 0 6737 | 0 | 5392) 0 1345 0.800 0.000 0.000 0.000 1.000 0.500 
6 6737 1345 5392 784 5953 | 519 | 5246 | 265 | 707 0.856 0.423 0.662 0.516 0.952 0.688 
Baseline | 7129 1344 5785 0 7129| 0 | 5785] 0 1344 0.811 0.000 0.000 0.000 1.000 0.500 
7 7129 1344 5785 728 6401 | 471 | 5640 | 257 | 761 0.857 0.382 0.647 0.481 0.956 0.669 
Baseline | 7163 1382 5781 0 7163 | 0 | 5781) 0 1382 0.807 0.000 0.000 0.000 1.000 0.500 
8 7163 1382 5781 723 6440 | 494 | 5642 | 229 | 798 0.857 0.382 0.683 0.490 0.961 0.672 
Baseline | 6919 1347 5572 0 6919 | 0 | 5572) 0 1347 0.805 0.000 0.000 0.000 1.000 0.500 
9 6919 1347 5572 552 6367 | 324 | 5416 | 228 | 951 0.830 0.254 0.587 0.355 0.960 0.607 
Baseline | 5204 1206 3998 0 5204 | 0 |3998! 0 1206 0.768 0.000 0.000 0.000 1.000 0.500 
10 5204 1206 3998 353 4851 | 188 | 3908 | 165 | 943 0.787 0.166 0.533 0.253 0.959 0.563 
Performance Metrics Average 0.841 0.316 0.624 0.417 0.959 0.637 
Average Accuracy Baseline 0.803 Average BCR Baseline 0.500 
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5.5.1.3 CRF Algorithm 


CRF model is one of the statistical modeling methods. CRF model used for building a 
model used in structure prediction. Many types of CRF models such as; linear-chine 
CRF, dynamic CRF, skip-chine models, have many applications in NLP such as; 
sentence phrasing, named entity recognition, POS tagging, etc. More details about this 


model in section 3.4.3. 


The next two sections discuss the experiments results of punctuation marks prediction. 
The CRF model was trained and tested using the BAQ Corpus for ten experiments for 


both categories of POS tags (i.e. 3 POS tags and 10 POS tags). 


5.5.1.3.1 Results of CRF Punctuation Mark Prediction using a Three-POS Tag Set 


This subsection presents the results of training and testing the CRF model for the ten 


experiments using the BAQ Corpus with 3-POS tags selected as a feature for prediction. 


The CRF model uses the word and its POS tags (i.e. 3 POS [noun, verb, and particle]) 
as features to predict punctuation marks or nopunc after each word in the dataset for 
each of the ten experiments. Table 5.5.1.3.1.1 shows the results of punctuation marks 
prediction using the CRF model. The average BCR scored 86.496 with increment of 
36.4% over the average BCR baseline 50%. Also, the average accuracy gained 93.4% 


with increment of 13.1% over average accuracy baseline 80.3%. 


On the other hand, the recall average scored 75.5%, the average precision scored 86.4%, 
and the average f-score scored 80.696. Furthermore, the average specificity scored 


97.4%. As we explained in section 5.2, these measures yield misleading results when 
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the data is imbalanced. Therefore, we will depend on the accuracy and BCR measures 


to compare the results of the experiments. 


5.5.1.3.2 Results of CRF Punctuation Mark Prediction using a Ten-POS Tag Set 


This subsection presents the results of training and testing the CRF model using the 
BAQ Corpus with ten POS tag selected as a feature for prediction for the ten 


experiments. 


Table 5.5.1.3.2.1 presents the results of the punctuation marks prediction using the CRF 
model for ten experiments. The CRF model in these experiments uses the word and 
POS tag (i.e. 10 POS [nominal, pronoun, verb, etc.]) to predict punctuation marks or 
nopunc after each word in the dataset. The average accuracy scored 92.296 with 
increment of 11.9% over the average baseline accuracy 80.3%. We also gained and 
increment of 35.8% in average BCR that scored 85.8% over the average BCR baseline 


of 50%. 


As we explained in section 5.2, the precision, recall, f-score, and specificity yield 
misleading when the data is imbalanced. The average recall scored 76.0%.The average 
precision scored 79.1%, and the average f-score scored 77.4%. On the other hand, the 


average specificity scored 95.7%. 


The CRF algorithm results with 3-POS tags for the prediction of punctuation marks 
proved its superiority over other algorithms (i.e. N-gram and HMM). This supremacy of 
the CRF model back to its feature selection characteristic, which enables this model to 
use different number of earlier and later features (i.e. words and their corresponding 


POS tag) that helps to investigate the tested dataset. 
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Figure 5.5.1.3.1 presents an example of the Qur'an text where it was automatically 
punctuated with punctuation marks using the CRF model with 3 POS tags. The 
consensuses between the original Qur'an text and the automatically punctuated text are 


highlighted in green. Controversies cases are highlighted in red. 


Original punctuated Quran text 
> KI ما‎ дд [мый وذكرى‎ bay а We tue قلا يكن في صذرك‎ 0 Ыб المص|‎ 
5 155 أو‎ s aly ав ЫСЫ АЎ ما 555 ,1 @ ِن‎ SUB f من دونه‎ uis s 9 


ЖУСА Bon aot o ЖШ рь @ Ne 5) Y باشعا‎ ЕКЕНІ 


الأعراف )7-5( 


Automatic punctuation of the Quran text 

ау S‏ فلا يكن في صذرك Qj |0 99 Ы te‏ مِنْ 
IS‏ ولا uc‏ من |а ass‏ ما 5,5%[ s‏ مِنْ gs Ый 0 дз‏ هم 5%[ 5 
کان реа sos [ыд 7% ыыбарыуда аа‏ سورة 


الأعراف )7-5( 


Figure 5.5.1.3.1: Automatic prediction of punctuation mark for a Qur'an text using CRF 
model with the 3-POS tags. 


Table 5.5.1.3.1.1: Punctuation marks prediction using CRF algorithm with 3-POS tag set. 
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Gold Prediction 

Non- Non- 
Exp. 4 N Punctuation | Punc | Punctuation | Punc TP TN FP FN | Accuracy | Recall | Precision | F-score | Specificity | BCR 
Baseline | 8523 1619 6904 0 8523 0 6904| O | 1619 0.810 0.000 0.000 0.000 1.000 0.500 
1 8523 1619 6904 1276 7247 | 1109 | 6859 | 167 | 388 0.935 0.741 0.869 0.800 0.976 0.859 
Baseline | 9378 1900 7478 0 9378 0 7478 0 | 1900 0.797 0.000 0.000 0.000 1.000 0.500 
2 9378 1900 7478 1519 7859 | 1242 | 7383 | 277 | 476 0.920 0.723 0.818 0.767 0.964 0.843 
Baseline | 9810 1853 7957 0 9810 0 7957 O | 1853 0.811 0.000 0.000 0.000 1.000 0.500 
3 9810 1853 7957 1431 8379 | 1212 | 7896 | 219 | 483 0.928 0.715 0.847 0.775 0.973 0.844 
Baseline | 8517 1659 6858 0 8517 0 6858 0 | 1659 0.805 0.000 0.000 0.000 1.000 0.500 
4 8517 1659 6858 1320 7197 | 1146 | 6800 | 174 | 397 0.933 0.743 0.868 0.801 0.975 0.859 
Baseline | 8050 1490 6560 0 8050 0 6560 | 0 | 1490 0.815 0.000 0.000 0.000 1.000 0.500 
5 8050 1490 6560 1179 6871 | 1032 | 6527 | 147 | 344 0.939 0.750 0.875 0.808 0.978 0.864 
Baseline | 6737 1345 5392 0 6737 0 5392 O | 1345 0.800 0.000 0.000 0.000 1.000 0.500 
6 6737 1345 5392 1121 5616 | 966 | 5359 | 155 | 7 0.939 0.790 0.862 0.824 0.972 0.881 
Baseline | 7129 1344 5785 0 7129 0 5785 0 | 1344 0.811 0.000 0.000 0.000 1.000 0.500 
7 7129 1344 5785 1072 6057 | 939 | 5763 | 133 | 294 0.940 0.762 0.876 0.815 0.977 0.869 
Baseline | 7163 1382 5781 0 7163 0 5781 0 | 1382 0.807 0.000 0.000 0.000 1.000 0.500 
8 7163 1382 5781 1136 6027 | 995 | 5749 | 141 | 8 0.942 0.782 0.876 0.826 0.976 0.879 
Baseline | 6919 1347 5572 0 6919 0 5572 O | 1347 0.805 0.000 0.000 0.000 1.000 0.500 
9 6919 1347 5572 955 5964 | 952 | 5543 | 116 | 308 0.939 0.756 0.891 0.818 0.980 0.868 
Baseline | 5204 1206 3998 0 5204 0 3998 0 | 1206 0.768 0.000 0.000 0.000 1.000 0.500 
10 5204 1206 3998 995 4209 | 854 | 3983 | 141 | 226 0.929 0.791 0.858 0.823 0.966 0.878 
Performance Metrics Average 0.934 0.755 0.864 0.806 0.974 0.864 
Average Accuracy Baseline 0.803 Average BCR Baseline 0.500 
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Table 5.5.1.3.2.1: Punctuation marks prediction using CRF algorithm with 10-POS tag set. 


Gold Prediction 

Non- Non- 
Exp. # N Punctuation | Punc | Punctuation | Punc | TP TN FP FN | Accuracy | Recall | Precision | F-score | Specificity | BCR 
Baseline | 8523 1619 6904 0 8523 0 6904 | 0 | 1619 0.810 0.000 0.000 0.000 1.000 0.500 
1 8523 1619 6904 1377 7146 | 1107 | 6781 | 270 | 365 0.925 0.752 0.804 0.777 0.962 0.857 
Baseline | 9378 1900 7478 0 9378 0 7478 | O | 1900 0.797 0.000 0.000 0.000 1.000 0.500 
2 9378 1900 7478 1385 7993 | 1076 | 7374 | 309 | 9 0.901 0.635 0.777 0.699 0.960 0.797 
Baseline | 9810 1853 7957 0 9810 0 7957 | O | 1853 0.811 0.000 0.000 0.000 1.000 0.500 
3 9810 1853 7957 1474 8336 | 1146 | 7802 | 328 | 534 0.912 0.682 0.777 0.727 0.960 0.821 
Baseline | 8517 1659 6858 0 8517 0 6858 | O | 1659 0.805 0.000 0.000 0.000 1.000 0.500 
4 8517 1659 6858 1391 7126 | 1133 | 6729 | 258 | 397 0.923 0.741 0.815 0.776 0.963 0.852 
Baseline | 8050 1490 6560 0 8050 0 6560 | O | 1490 0.815 0.000 0.000 0.000 1.000 0.500 
5 8050 1490 6560 1369 6681 | 1076 | 6397 | 293 | 284 0.928 0.791 0.786 0.789 0.956 0.874 
Baseline | 6737 1345 5392 0 6737 0 5392 | 0 | 1345 0.800 0.000 0.000 0.000 1.000 0.500 
6 6737 1345 5392 1228 5509 | 961 | 5273 | 267 | 236 0.925 0.803 0.783 0.793 0.952 0.877 
Baseline | 7129 1344 5785 0 7129 0 5785 | O | 1344 0.811 0.000 0.000 0.000 1.000 0.500 
7 7129 1344 5785 1240 5889 | 1000 | 5663 | 240 | 226 0.935 0.816 0.806 0.811 0.959 0.888 
Baseline | 7163 1382 5781 0 7163 0 5781 | 0 | 1382 0.807 0.000 0.000 0.000 1.000 0.500 
8 7163 1382 5781 1278 5885 | 1045 | 5662 | 233 | 3 0.936 0.824 0.818 0.821 0.960 0.892 
Baseline | 6919 1347 5572 0 6919 0 5572 | 0 | 1347 0.805 0.000 0.000 0.000 1.000 0.500 
9 6919 1347 5572 1208 5711 | 952 | 5452 | 256 | 259 0.926 0.786 0.788 0.787 0.955 0.871 
Baseline | 5204 1206 3998 0 5204 0 3998 | 0 | 1206 0.768 0.000 0.000 0.000 1.000 0.500 
10 5204 1206 3998 1043 4161 | 792 | 3925 | 251 | 236 0.906 0.770 0.759 0.765 0.940 0.855 
Performance Metrics Average 0.922 0.760 0.791 0.774 0.957 0.858 
Average Accuracy Baseline 0.803 Average BCR Baseline 0.500 
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5.5.2 Prediction of Sentence Terminals (Two Class Problem) 


This section presents the results for the ten experiments for the task of sentence terminal 
prediction after training and testing three ML algorithms (i.e. N-gram, HMM, and CRF) 
using the BAQ Corpus. Two categories of the POS tags (i.e. 3 POS and 10 POS) were 


selected as features for the three algorithms. 


5.5.2.1 N-gram Algorithm 


The N-gram model is one of the probabilistic language models that are concerned with 
predicting the next item in a sequence of observations. Predicting using the N-gram 
model is performed by computing the probability of a sequence of observations and 
selecting the highest probability sequence. The N-gram model has many applications in 
the Natural Language processing field such as; segmentation, Part Of Speech tagging, 
etc. More detail about this model is found in section 3.4.1. The N-gram model was 
trained and tested on the BAQ Corpus for sentence terminal prediction. The obtained 
results for experiments, using both categories of POS-tags as features, are described in 


the next two sections. 


5.5.2.1.1 Results of N-gram Sentence Terminal Prediction using a Three-POS Tag 


Set 


This subsection presents the results of sentence terminal prediction after training and 
testing the N-gram model using the BAQ Corpus with 3 POS tags selected as a feature 


for prediction. 


Table 5.5.2.1.1.1 shows the results of sentence terminal prediction using the N-gram 


model for ten experiments. The N-gram model was trained and tested using the BAQ 
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Corpus for sentence terminal prediction, where the N-gram model used the word and 
POS tags (i.e. 3 POS [noun, verb, and particle]) to predict the sentence terminal (i.e. the 
word located at the end of the sentence) or non (i.e. the word does not located at the end 
of the sentence). The average accuracy scored 91.8% with increment of 3.0% over the 
average accuracy baseline 88.8%. The average BCR gained increment of 20.0% over 


the average BCR baseline 50.096 which scored 70.096. 


The average recall scored 41.7%, the average precision scored 73.5%, the average f- 
score scored 52.896, and the average specificity scored 98.296. Positive cases (i.e. 
sentence terminals) present less than 12% compared to more than 88% of negative cases 
(i.e. non) in the dataset (i.e. BAQ Corpus). The previous two percentages indicate that 
our dataset is skewed (i.e. imbalanced) data towards non (i.e. no sentence terminal). The 
recall, precision, and f-score present correct predictions for positive cases (i.e. how 
many sentence terminals were predicted as sentence terminals). On the other hand, the 
specificity presents the correct predictions for negative cases (i.e. how many non was 
predicted as non). Because our dataset is imbalanced, this means that the four metrics 
(ie. recall, precision, f-score, and specificity) yield misleading results, more details 
about this four metrics is found in section 5.2. While the accuracy rate and BCR metrics 
can deal with the problem of skewed data, therefore, we will depend on the accuracy 
rate and BCR metrics for comparing the results of the three ML algorithms of sentence 
terminal prediction experiments. More information about skewed data is found in 


section 5.3. 


For each experiment we listed the positive and negative prediction values representing 
elements of confusion matrix. TPs, TNs represent the correct predictions of sentence 


terminals and non (i.e. no sentence terminal) respectively. FPs and FNs represent the 
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wrong (false) predictions of sentence terminals and non (i.e. no sentence terminal) 
respectively. The number of TPs indicates primitive measure of the algorithm 


performance. 


5.5.2.1.2 Results of N-gram Sentence Terminal Prediction using a Ten-POS Tag 


Set 


This subsection presents the results of ten experiments after training and testing the N- 
gram model using the BAQ Corpus with ten POS tags selected as a feature for 


prediction.. 


Table 5.5.2.1.2.1 shows the results of predicting sentence terminals using the N-gram 
model for ten experiments. The N-gram model in these experiments uses the word and 
POS tag (i.e. 10 POS [nominal, verb, pronoun, etc.]) to predict sentence terminals or 
non (i.e. the word is not the end of the sentence) after each word in the tested dataset. 
The average accuracy scored 91.896 with increment of 3.096 over the average accuracy 
baseline 88.8%. Also we gained an increment of 19.6% over the average BCR baseline 


50% which scored 69.6%. 


The average recall scored 41%, the average precision scored 74.2%, and the average 
precision scored 52.3%. On the other hand, the average specificity scored 98.3%. As we 
explained in section 5.2, the recall, precision, f-score, and specificity yield misleading 
when the dataset is imbalanced. Therefore, we will concentrate on the accuracy and 
BCR for comparing the results of the conducted experiments with different ML 


algorithms. 


The discrepancy between the results of using both categories of POS tags (i.e. 3 POS 


and 10 POS) as features for experiments insures the assumption that the language reader 
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need not possess a high level of linguistic competence to learn how to correctly 


punctuate a piece of text. 


Figure 5.5.2.1.1 shows a sample of text from the Qur'an where sentence terminals were 
automatically added using N-gram algorithm. In this example, the N-gram algorithm 
uses 3 POS tags and the word as features for predicting the sentence terminals in the 
sample. The consensuses between the original Qur'an text and the automatically 
punctuated text are highlighted in green. Controversies cases are highlighted in red. 


Where the (*) symbol indicates sentence terminal. 


Original punctuated Quran text 


E إن‎ Uie ul t حكم‎ A a f oui Site في لي‎ adr إن‎ | audi ost e 
AIIE зу RS إن‎ s Us os الشماواتِ‎ АЙША ds من‎ 5 lo 
мй дё JES السَمَاء‎ са Буй AS АЗ abst SUT ورب‎ 5 Сз بي‎ 
ыны Sys 15 الذِكْرى‎ Ad GTP مُؤيئون‎ É lial te Cast us Fadl Olle s الاس‎ 


(Y-3) الدخان‎ 


Automatic punctuation of the Quran text 
Ой ые من‎ p] Ке ENSE ын كنا‎ ы аш ый әк 


RETE бума Йыл о bell ell 5 90,45 hay‏ لا إه إلا هو 


Әз | оё الشماء بذحان‎ әс | يبون‎ аз م في‎ Ый іе S5 sas ux 
سورة‎ СТАЕ a a ói B. اعاب إ6‎ в cus us [uf عَنَابٌ‎ is الاس‎ 
(Y-3) الدخان‎ 

Figure 5.5.2.1.1: Automatic sentence terminal prediction for a Qur'an text using 


N-gram model with the 3-POS tags. 
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Table 5.5.2.1.1.1: Sentence terminal prediction using N-gram algorithm with 3-POS tag set. 


Gold Prediction 
Non- Non- 
Exp. # N Terminal | Terminal | Terminal | Terminal | TP TN FP | FN | Accuracy | Recall | Precision | F-score | Specificity | BCR 
Baseline | 8523 836 7687 0 8523 0 7687 0 836 0.902 0.000 0.000 0.000 1.000 0.500 
1 8523 836 7687 467 8056 354 | 7574 | 113 | 482 0.930 0.423 0.758 0.543 0.985 0.704 
Baseline | 9378 836 8542 0 9378 0 8542 0 836 0.911 0.000 0.000 0.000 1.000 0.500 
2 9378 836 8542 496 8882 347 | 8393 | 149 | 489 0.932 0.415 0.700 0.521 0.983 0.699 
Baseline | 9810 836 8974 0 9810 0 8974 0 836 0.915 0.000 0.000 0.000 1.000 0.500 
3 9810 836 8974 615 9195 448 | 8807 | 167 | 388 0.943 0.536 0.728 0.618 0.981 0.759 
Baseline | 8517 836 7681 0 8517 0 7681 0 836 0.902 0.000 0.000 0.000 1.000 0.500 
4 8517 836 7681 527 7990 399 | 7553 | 128 | 437 0.934 0.477 0.757 0.585 0.983 0.730 
Baseline | 8050 836 7214 0 8050 0 7214 0 836 0.896 0.000 0.000 0.000 1.000 0.500 
5 8050 836 7214 348 7702 233 | 7099 | 115 | 603 0.911 0.279 0.670 0.394 0.984 0.631 
Baseline | 6737| 836 5901 0 6737 0 |5901| о |836) 0.876 | 0.000 | 0.000 0.000 1.000 | 0.500 
6 6737 836 5901 524 6213 423 | 5800 | 101 | 413 0.924 0.506 0.807 0.622 0.983 0.744 
Baseline | 7129 836 6293 0 7129 0 6293 0 836 0.883 0.000 0.000 0.000 1.000 0.500 
7 7129 836 6293 513 6616 399 | 6179 | 114 | 437 0.923 0.477 0.778 0.592 0.982 0.730 
Baseline | 7163 836 6327 0 7163 0 6327 0 836 0.883 0.000 0.000 0.000 1.000 0.500 
8 7163 836 6327 497 6666 386 | 6216 | 111 | 450 0.922 0.462 0.777 0.579 0.982 0.722 
Baseline | 6919 836 6083 0 6919 0 6083 0 836 0.879 0.000 0.000 0.000 1.000 0.500 
9 6919 836 6083 405 6514 288 | 5966 | 117 | 548 0.904 0.344 0.711 0.464 0.981 0.663 
Baseline | 5204 842 4362 0 5204 0 4362 0 842 0.838 0.000 0.000 0.000 1.000 0.500 
10 5204 842 4362 321 4883 212 | 4253 | 109 | 630 0.858 0.252 0.660 0.365 0.975 0.613 
Performance Metrics Average 0.918 0.417 0.735 0.528 0.982 0.700 
Accuracy Baseline Average 0.888 BCR Baseline Average 0.500 
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Table 5.5.2.1.2.1: Sentence terminal prediction using N-gram algorithm with 10-POS tag set. 


Gold Prediction 

Non- Non- 
Exp. # N Terminal | Terminal | Terminal | Terminal | TP | TN FP | FN | Accuracy | Recall | Precision | F-score | Specificity BCR 
Baseline | 8523 836 7687 0 8523 0 |7687| 0 | 836 0.902 0.000 0.000 0.000 1.000 0.500 
1 8523 836 7687 469 8054 355 | 7573 | 114 | 481) 0.930 0.425 0.757 0.544 0.985 0.705 
Baseline | 9378 836 8542 0 9378 0 | 8542) 0 | 836 0.911 0.000 0.000 0.000 1.000 0.500 
2 9378 836 8542 470 8908 332 | 8404 | 138 | 504 | 0.932 0.397 0.706 0.508 0.984 0.690 
Baseline | 9810 836 8974 0 9810 0 | 8974) 0 | 836 0.915 0.000 0.000 0.000 1.000 0.500 
3 9810 836 8974 615 9195 448 | 8807 | 167 | 388 | 0.943 0.536 0.728 0.618 0.981 0.759 
Baseline | 8517 836 7681 0 8517 O |7681| O | 836 0.902 0.000 0.000 0.000 1.000 0.500 
4 8517 836 7681 520 7997 397 | 7558 | 123 | 439 | 0.934 0.475 0.763 0.586 0.984 0.729 
Baseline | 8050 836 7214 0 8050 O | 7214] 0 | 836 0.896 0.000 0.000 0.000 1.000 0.500 
5 8050 836 7214 333 7717 223 | 7104 | 110 | 613 | 0.910 0.267 0.670 0.382 0.985 0.626 
Baseline | 6737 836 5901 0 6737 0 |5901 | 0 | 836 0.876 0.000 0.000 0.000 1.000 0.500 
6 6737 836 5901 513 6224 421 | 5809 | 92 |415 0.925 0.504 0.821 0.624 0.984 0.744 
Baseline | 7129 836 6293 0 7129 0 | 6293| 0 | 836 0.883 0.000 0.000 0.000 1.000 0.500 
7 7129 836 6293 490 6639 391 | 6194 | 99 | 445 | 0.924 0.468 0.798 0.590 0.984 0.726 
Baseline | 7163 836 6327 0 7163 0 | 6327! 0 | 836 0.883 0.000 0.000 0.000 1.000 0.500 
8 7163 836 6327 481 6682 378 | 6224 | 103 | 458 | 0.922 0.452 0.786 0.574 0.984 0.718 
Baseline | 6919 836 6083 0 6919 O |6083 | 0 | 836 0.879 0.000 0.000 0.000 1.000 0.500 
9 6919 836 6083 387 6532 280 | 5976 | 107 | 556 0.904 0.335 0.724 0.458 0.982 0.659 
Baseline | 5204 842 4362 0 5204 0 | 4362; 0 | 842 0.838 0.000 0.000 0.000 1.000 0.500 
10 5204 842 4362 298 4906 200 | 4264 | 98 | 642 | 0.858 0.238 0.671 0.351 0.978 0.608 
Performance Metrics Average 0.918 0.410 0.742 0.523 0.983 0.696 
Accuracy Baseline Average 0.888 BCR Baseline Average 0.500 
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5.5.2.2 HMM Algorithm 


This section presents the results of training and testing the HMM model using the BAQ 
Corpus for ten experiments to predict sentence terminals. Two categories of POS tags 


were experimented (i.e. 3 POS and 10 POS)as features for the HMM algorithm. 


5.4.2.1 Results of HMM Sentence Terminal Prediction using a Three-POS Tag Set 


This subsection presents the results of training and testing the HMM model for ten 


experiments using the BAQ Corpus with three POS tags. 


Table 5.4.2.1.1 presents the results of predicting sentence terminals using the HMM 
model for ten experiments. The HMM model uses the word and POS tag (i.e. 3 POS 
[noun, verb, and particle]) to predict sentence terminal or non (i.e. the word does not 


located at the end of the sentence) for the words in test dataset for the ten experiments. 


The results show, that the average accuracy scored 91.896 with increment of 3.096 over 
the average accuracy baseline 88.8%, while the average BCR scored 69.2% over the 
average BCR baseline 50.0% with increment of 19.2%. The average recall scored 
40.0%, the average precision scored 75.0%, and the average f-score scored 51.8%. 


Furthermore, the average specificity scored 98.4%. 


5.4.2.2 Results of HMM Sentence Terminal Prediction using a Ten-POS Tag Set 


This subsection presents the results of sentence terminal prediction after training and 
testing the HMM model using BAQ Corpus with ten POS tags selected as a feature for 


perdition. 


The HMM model was trained and tested using the BAQ Corpus with 10 POS tags 


(nominal, verb, particle, pronoun, etc.). The HMM model uses the word and POS tags 
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(10 POS) for predicting sentence terminals. Table 19 shows the results of predicting 
sentence terminals using the HMM model for ten experiments. The average accuracy 
scored 91.8% over the average accuracy baseline 88.8% with increment of 3.0%. The 
average BCR rate scored 69.0% with increment of 19.0% over the average BCR 
baseline 50.0%. The average recall scored 39.4%, the average precision scored 76.1%, 
and the average f-score scored 51.496. On the hand, the average specificity scored 
98.5%. While our dataset in this research is imbalanced, and the four metrics (i.e. recall, 
precision, f-score, and specificity) do not have the ability to deal with imbalanced data, 
we would concentrate on the accuracy and BCR to compare the results of our 


experiments. 


Figure 5.5.2.2.1 shows a sample of text from the Qur'an where sentence terminals were 
automatically added using HMM model with 3 POS tags. The consensuses between the 
original Qur'an text and the automatically punctuated text are highlighted in green. 


Controversies cases are highlighted in red. Star (*) symbol indicates sentence terminal. 
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[Original punctuated Quran text = | 
SE tate bo AE حكم‎ AK ير‎ ә} „ын OY Ke لَك‎ allyl OF ый حم م والکقاب‎ 
MATE азу RS إن‎ s Us oily الشماواتِ‎ Os А السَّمِيعْ‎ gà | ds رم من‎ day 
ئى‎ дё JES السَمَاء‎ но Буй AS АР ӨЙ SUT ورب‎ CREE 
سورة‎ B м dis Bole % 3 Ad GIP مُؤيئون‎ йы о عَنَابٌ 6 ربا شف‎ ia дю 


(Y-3) الدخان‎ 


Automatic punctuation of the Quran text 

SG ِن‎ E Кз ә ышы مَُاركةٍ إ6 كنا‎ БЕЙ | > 
EDER KOE а о жарата الل‎ geal gi а Eos 
Sx | الشماء دخان مين‎ de cm | s дз بل هم في‎ | o Rn os SS ess ux 
КЕС | i لهم‎ oes od ee uds abi de s الئاس‎ 


(Y-3) الدخان‎ 


Figure 5.5.2.2.1: Automatic sentence terminal prediction for a Qur'an text using HMM 
model with the 3-POS tags. 
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Table 5.4.2.1.1: Sentence terminal prediction using HMM algorithm with 3-POS tag set. 


Gold Prediction 

Non- Non- 
Exp. # N Terminal | Terminal | Terminal | Terminal | TP TN FP FN | Accuracy | Recall | Precision | F-score | Specificity | BCR 
Baseline | 8523 836 7687 0 8523 0 7687 0 836 0.902 0.000 0.000 0.000 1.000 0.500 
1 8523 836 7687 458 8065 349 | 7578 | 109 | 487 0.930 0.417 0.762 0.539 0.986 0.702 
Baseline | 9378 836 8542 0 9378 0 8542 0 836 0.911 0.000 0.000 0.000 1.000 0.500 
2 9378 836 8542 452 8926 330 | 8420 | 122 | 506 0.933 0.395 0.730 0.512 0.986 0.690 
Baseline | 9810 836 8974 0 9810 0 8974 0 836 0.915 0.000 0.000 0.000 1.000 0.500 
3 9810 836 8974 564 9246 426 | 8836 | 138 | 410 0.944 0.510 0.755 0.609 0.985 0.747 
Baseline | 8517 836 7681 0 8517 0 7681 0 836 0.902 0.000 0.000 0.000 1.000 0.500 
4 8517 836 7681 493 8024 384 | 7572 | 109 | 452 0.934 0.459 0.779 0.578 0.986 0.723 
Baseline | 8050 836 7214 0 8050 0 7214 0 836 0.896 0.000 0.000 0.000 1.000 0.500 
5 8050 836 7214 337 7713 228 | 7105 | 109 | 608 0.911 0.273 0.677 0.389 0.985 0.629 
Baseline | 6737 836 5901 0 6737 0 5901 0 836 0.876 0.000 0.000 0.000 1.000 0.500 
6 6737 836 5901 500 6237 411 5812 89 | 425 0.924 0.492 0.822 0.615 0.985 0.738 
Baseline | 7129 836 6293 0 7129 0 6293 0 836 0.883 0.000 0.000 0.000 1.000 0.500 
7 7129 836 6293 490 6639 391 | 6194 | 99 | 445 0.924 0.468 0.798 0.590 0.984 0.726 
Baseline | 7163 836 6327 0 7163 0 6327 0 836 0.883 0.000 0.000 0.000 1.000 0.500 
8 7163 836 6327 476 6687 372 | 6223 | 104 | 464 0.921 0.445 0.782 0.567 0.984 0.714 
Baseline | 6919 836 6083 0 6919 0 6083 0 836 0.879 0.000 0.000 0.000 1.000 0.500 
9 6919 836 6083 375 6544 278 | 5986 | 97 | 558 0.905 0.333 0.741 0.459 0.984 0.658 
Baseline | 5204 842 4362 0 5204 0 4362 0 842 0.838 0.000 0.000 0.000 1.000 0.500 
10 5204 842 4362 274 4930 179 | 4267 | 95 | 663 0.854 0.213 0.653 0.321 0.978 0.595 
Performance Metrics Average 0.918 0.401 0.750 0.518 0.984 0.692 
Accuracy Baseline Average 0.888 BCR baseline Average 0.500 
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Table 5.5.2.2.1: Sentence terminal prediction using HMM algorithm with 10-POS tag set. 


Gold Prediction 
Non- Non- 
Exp. # N Terminal | Terminal | Terminal | Terminal TP | TN FP | FN | Accuracy | Recall | Precision | F-score | Specificity | BCR 
Baseline | 8523 836 7687 0 8523 0 |7687| 0 | 836 0.902 0.000 0.000 0.000 1.000 0.500 
1 8523 836 7687 456 8067 350 | 7581 | 106 | 486 0.931 0.419 0.768 0.542 0.986 0.702 
Baseline | 9378 836 8542 0 9378 0 | 8542 0 | 836 0.911 0.000 0.000 0.000 1.000 0.500 
2 9378 836 8542 437 8941 322 | 8427 | 115 | 514 0.933 0.385 0.737 0.506 0.987 0.686 
Baseline | 9810 836 8974 0 9810 0 | 8974 | 0 | 836 0.915 0.000 0.000 0.000 1.000 0.500 
3 9810 836 8974 563 9247 426 | 8837 | 137 | 410 | 4 0.510 0.757 0.609 0.985 0.747 
Baseline | 8517 836 7681 0 8517 O | 7681 | 0 | 836 0.902 0.000 0.000 0.000 1.000 0.500 
4 8517 836 7681 481 8036 382 | 7582 | 99 |454 0.935 0.457 0.794 0.580 0.987 0.722 
Baseline | 8050 836 7214 0 8050 0 |7214 0 | 836 0.896 0.000 0.000 0.000 1.000 0.500 
5 8050 836 7214 319 7731 219 | 7114 | 100 | 617 | 0.911 0.262 0.687 0.379 0.986 0.624 
Baseline | 6737 836 5901 0 6737 0 |5901, 0 | 836 0.876 0.000 0.000 0.000 1.000 0.500 
6 6737 836 5901 486 6251 407 | 5822 | 79 | 429 | 0.925 0.487 0.837 0.616 0.987 0.737 
Baseline | 7129 836 6293 0 7129 0 | 6293 | 0 | 836 0.883 0.000 0.000 0.000 1.000 0.500 
7 7129 836 6293 470 6659 385 | 6208 | 85 | 451 | 0.925 0.461 0.819 0.590 0.986 0.724 
Baseline | 7163 836 6327 0 7163 0 | 6327 0 | 836 0.883 0.000 0.000 0.000 1.000 0.500 
8 7163 836 6327 465 6698 368 | 6230 | 97 |468 | 0.921 0.440 0.791 0.566 0.985 0.712 
Baseline | 6919 836 6083 0 6919 O | 6083 | 0 | 836 0.879 0.000 0.000 0.000 1.000 0.500 
9 6919 836 6083 361 6558 271 | 5993 | 90 | 565 | 0.905 0.324 0.751 0.453 0.985 0.655 
Baseline | 5204 842 4362 0 5204 0 |4362) 0 | 842 0.838 0.000 0.000 0.000 1.000 0.500 
10 5204 842 4362 247 4957 165 | 4280 | 82 |677 | 0.854 0.196 0.668 0.303 0.981 0.589 
Performance Metrics Average 0.918 0.394 0.761 0.514 0.985 0.690 
Accuracy Baseline Average 0.888 BCR baseline Average 0.500 
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5.5.2.3 CRF Algorithm 


This section presents the results of training and testing the CRF model for sentence 
terminal prediction using the BAQ Corpus. Two categories of POS tag sets (i.e. 3 POS 


and 10 POS) were selected as features for prediction sentence terminals. 


5.5.2.3.1 Results of CRF Sentence Terminal Prediction using a Three-POS Tag Set 


This subsection presents the results of the conducted experiments for sentence terminal 
prediction by applying the CRF model. The CRF model was trained and tested using the 


BAQ Corpus with 3 POS tags were selected as a feature for prediction. 


Table 5.5.2.3.1.1 shows the results of sentence terminal prediction using the CRF model 
for ten experiments. The CRF model uses the words and POS tags (i.e. 3 POS) as 
features for predicting sentence terminal (i.e. the word is located at the end of the 
sentence) or non (i.e. the word is not located at the end of the sentence) for the ten 
experiments. The average accuracy scored 91.496 with increment of 2.696 over the 
average accuracy baseline 88.8%. The average BCR scored 64.6% over the average 
BCR baseline 50.0% with increment of 16.6%. 

On the other hand, the average recall scored 29.9%, the average precision scored 81.2%, 
and the f-scored scored 43.1%. Furthermore, the average specificity scored 99.2%. The 
dataset in our experiments is imbalanced. As we explained in section 5.2, these four 
metrics yield misleading when the data is imbalanced. Therefore, we will depend on the 


accuracy and BCR metrics for comparing the conducted experiments. 


114 


5.5.2.3.2 Results of CRF Sentence Terminal Prediction using Ten-POS Tag Set 


This subsection presents the results of the ten experiments for the sentence terminal 


prediction using CRF model. 


The CRF model was trained and tested using the BAQ Corpus with ten POS tags were 
selected as a feature for prediction. Table 5.5.2.3.2.] presents the results of the 
conducted experiments for sentence terminal prediction. The average accuracy scored 
91.4% over the average accuracy baseline 88.8% with increment of 2.6%. On the other 
hand, the average BCR scored 65.096 with increment of 15.096 over the average BCR 


baseline 50.0%. 


In return, the average recall rate scored 30.9%, the average precision scored 80.5%, the 


average f-score scored 44.0%, and the average specificity scored 99.2%. 


Figure 5.5.2.3.1 shows a sample of text from the Qur'an where sentence terminals were 
automatically added using CRF model with 3 POS tags. The consensuses between the 
original Qur'an text and the automatically punctuated text are highlighted in green. 
Controversies cases are highlighted in red. The star (*) symbol indicates sentence 


terminal. 
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Original punctuated Quran text 

ааа SS d ix! ыш WH eid aai YF ол ح ا والكتاب‎ 
إله إلا هو‎ Y! о АЁ إن‎ ш Us السَمَاوَاتٍ وَالْأَرْضٍ‎ os ЫЙ ash gh А) ds ؟ 325 من‎ aos 
ый дё JES السَمَاء‎ о Буй AS ټل هم في‎ ЫЙ 5» 5 Сыз بي‎ 
رشول مين أ| سورة‎ Pole 2% الذِكْرى‎ Ad GTP عا الْعَدَابَ إا مُؤيئون‎ Aasi us Fadl عَنَابٌ‎ s ptt 


)۱۳-١( الدخان‎ 


Automatic punctuation of the Quran text 

LERT & Кз a) إن كنا‎ Se GS gals а К 
$33 لا‎ ы, BS إن‎ s Йыл bell fell 5 2%; ِن‎ s | hat 
olx o ge الشماء‎ с [| ok a هم في‎ [on Ros S5 eas 4 
رشول 8 | سورة‎ ole وَقَدْ‎ e d о إا‎ а ca сызы الئاس‎ 


(Y-3) الدخان‎ 


Figure 5.5.2.3.1: Automatic sentence terminal prediction for a Qur'an text using CRF 


model with the 3-POS tags. 
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Table 5.5.2.3.1.1: Sentence terminal prediction using CRF algorithm with 3-POS tags. 


Gold Prediction 

Non- Non- 
Exp. # N Terminal | Terminal | Terminal | Terminal | TP | TN | FP | FN | Accuracy | Recall | Precision | F-score | Specificity | BCR 
Baseline | 8523 836 7687 0 8523 O |7687 | O | 836 0.902 0.000 0.000 0.000 1.000 0.500 
1 8523 836 7687 350 8173 298 | 7635 | 52 | 538 | 0.931 0.356 0.851 0.503 0.993 0.675 
Baseline | 9378 836 8542 0 9378 0 | 8542 | O | 836 0.911 0.000 0.000 0.000 1.000 0.500 
2 9378 836 8542 329 9049 256 | 8469 | 73 | 580 | 0.930 0.306 0.778 0.439 0.991 0.649 
Baseline | 9810 836 8974 0 9810 0 |8974 | 0 | 836 0.915 0.000 0.000 0.000 1.000 0.500 
3 9810 836 8974 419 9391 335 | 8890 | 84 | 501 0.940 0.401 0.800 0.534 0.991 0.696 
Baseline | 8517 836 7681 0 8517 0 | 7681| 0 | 836 0.902 0.000 0.000 0.000 1.000 0.500 
4 8517 836 7681 346 8171 276 | 7611 | 70 | 560 | 0.926 0.330 0.798 0.467 0.991 0.661 
Baseline | 8050 836 7214 0 8050 0 | 7214) 0 | 836 0.896 0.000 0.000 0.000 1.000 0.500 
5 8050 836 7214 173 7877 127 | 7168 | 46 | 709 | 0.906 0.152 0.734 0.252 0.994 0.573 
Baseline | 6737 836 5901 0 6737 0 |5901 | 0 | 836 0.876 0.000 0.000 0.000 1.000 0.500 
6 6737 836 5901 368 6369 329 | 5862 | 39 | 507 | 0.919 0.394 0.894 0.547 0.993 0.693 
Baseline | 7129 836 6293 0 7129 0 | 6293) 0 | 836 0.883 0.000 0.000 0.000 1.000 0.500 
7 7129 836 6293 343 6786 297 | 6247 | 46 | 539 | 0.918 0.355 0.866 0.504 0.993 0.674 
Baseline | 7163 836 6327 0 7163 0 |6327 | 0 | 836 0.883 0.000 0.000 0.000 1.000 0.500 
8 7163 836 6327 328 6835 284 | 6283 | 44 | 552 | 0.917 0.340 0.866 0.488 0.993 0.666 
Baseline | 6919 836 6083 0 6919 0 | 6083| 0 | 836 0.879 0.000 0.000 0.000 1.000 0.500 
9 6919 836 6083 258 6661 207 | 6032 | 51 629 0.902 0.248 0.802 0.378 0.992 0.620 
Baseline | 5204 842 4362 0 5204 0 | 4362 | 0 | 842 0.838 0.000 0.000 0.000 1.000 0.500 
10 5204 842 4362 130 5074 95 | 4327 | 35 | 747 | 0.850 0.113 0.731 0.195 0.992 0.552 
Performance Metrics Average 0.914 0.299 0.812 0.431 0.992 0.646 
Accuracy Baseline Average 0.888 BCR Baseline Average 0.500 
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Table 5.5.2.3.2.1: Sentence terminal prediction using CRF algorithm with 10-POS tags. 


Gold Prediction 
Non- Non- 
Exp. # N Terminal | Terminal | Terminal | Terminal | ТР | TN | РР | FN | Accuracy | Recall | Precision | F-score | Specificity | BCR 
Baseline | 8523 836 7687 0 8523 0 | 7687 | 0 | 836 0.902 0.000 0.000 0.000 1.000 0.500 
0 8523 836 7687 357 8166 303 | 7633 | 54 | 533 | 0.931 0.362 0.849 0.508 0.993 0.678 
Baseline | 9378 836 8542 0 9378 0 | 8542 | 0 | 836 0.911 0.000 0.000 0.000 1.000 0.500 
1 9378 836 8542 362 9016 278 | 8458 | 84 | 558 | 0.932 0.333 0.768 0.464 0.990 0.661 
Baseline | 9810 836 8974 0 9810 0 | 8974 | 0 | 836 0.915 0.000 0.000 0.000 1.000 0.500 
2 9810 836 8974 427 9383 339 | 8886 | 88 | 497 | 0.940 0.406 0.794 0.537 0.990 0.698 
Baseline | 8517 836 7681 0 8517 0 |7681! 0 | 836 0.902 0.000 0.000 0.000 1.000 0.500 
3 8517 836 7681 352 8165 283 | 7612 | 69 | 553 | 7 0.339 0.804 0.476 0.991 0.665 
Baseline | 8050 836 7214 0 8050 0 | 7214 | 0 | 836 0.896 0.000 0.000 0.000 1.000 0.500 
4 8050 836 7214 190 7860 140 | 7164 | 50 696 | 7 0.167 0.737 0.273 0.993 0.580 
Baseline | 6737 836 5901 0 6737 0 |5901! 0 | 836 0.876 0.000 0.000 0.000 1.000 0.500 
5 6737 836 5901 386 6351 341 | 5856 | 45 | 495 | 0.920 0.408 0.883 0.558 0.992 0.700 
Baseline | 7129 836 6293 0 7129 0 | 6293 | 0 | 836 0.883 0.000 0.000 0.000 1.000 0.500 
6 7129 836 6293 358 6771 305 | 6240 | 53 | 531 | 0.918 0.365 0.852 0.511 0.992 0.678 
Baseline | 7163 836 6327 0 7163 0 | 6327 | 0 | 836 0.883 0.000 0.000 0.000 1.000 0.500 
7 7163 836 6327 340 6823 295 | 6282 | 45 | 541 | 0.918 0.353 0.868 0.502 0.993 0.673 
Baseline | 6919 836 6083 0 6919 0 | 6083 | 0 | 836 0.879 0.000 0.000 0.000 1.000 0.500 
8 6919 836 6083 259 6660 204 | 6028 | 55 | 632 | 0.901 0.244 0.788 0.373 0.991 0.617 
Baseline | 5204 842 4362 0 5204 0 | 4362 0 | 842 0.838 0.000 0.000 0.000 1.000 0.500 
9 5204 842 4362 139 5065 99 | 4322 | 40 | 743 | 0.850 0.118 0.712 0.202 0.991 0.554 
Performance Metrics Average 0.914 0.309 0.805 0.440 0.992 0.650 
Accuracy Baseline Average 0.888 BCR Baseline Average 0.500 
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5.5.3 Predicting Punctuation Marks in MSA Texts 


This section applies the best ML algorithm based on the previous results to predict 
punctuation marks for MSA text. The selected algorithm uses the BAQ Corpus as 
training dataset. For the purpose of experiment, we selected MSA text from the book 
في الطريق‎ eua “ma 'alim fi at-tariq" (Qutb 1979) for Sayyid Qutb. Sayyid Qutb is also 
the author of JA في ظلال‎ "fr zilal al-qur’an” (Qutb 1991) which we used to add 
punctuation marks to the BAQ Corpus. Our purpose of doing this experiment is to prove 


that we can transfer knowledge from the Quran to MSA. 


The chosen text consists of 3859 words without counting punctuation marks; each word 
was manually tagged with coarse POS (i.e. 3 POS). The text was tokenized into rows 
such that each row consists of: word, POS and punctuation tag. The algorithm that has 
the best performance results for predicting punctuation marks in the BAQ Corpus was 
used to predict the punctuation marks in the MSA text. The punctuated text was 
compared with the original tagged text using different metrics. More detail about the 
design of experiment punctuation marks prediction for the MSA text is found in section 


4.3. 


The results showed that the CRF algorithm obtained the highest performance results 
using the word and 3-POS tags as features for prediction. The CRF algorithm was 


trained on the BAQ Corpus to predict punctuation marks for the MSA sample. 


Table 5.5.3.1 presents the results of punctuation marks prediction for the MSA text 
using the CRF algorithm. The CRF algorithm uses the word and POS tags (i.e. 3 POS) 
to predict punctuation marks or nopunc after each word in the MSA text. The average 


accuracy scored 90.4% with increment of 5.3% over the average accuracy baseline of 
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85.1%. On the other hand, the average BCR scored 69.0% over the average BCR 50.0% 
with increment of 19.0%. Furthermore, the average recall scored 39.7%, the average 
precision 78.4%, the average f-score scored 52.7%, and the average specificity scored 


98.3%. 


Several conclusions are drawn from the experiment of punctuation marks prediction for 
MSA text: (1) Knowledge learned using ML algorithms can be transferred from the 
Qur'an text to the MSA text. (ii) The Qur'an and MSA texts share grammatical 
structures that are enough for training and testing ML algorithms. (iii) Using 3 POS tags 
(i.e. coarse annotation) as a feature indicates that this level of linguistic competence can 
be enough for the ML algorithms as well as for the language user to predict punctuation 
marks for any MSA text. Figure 5.5.3.1 presents a sample of the selected MSA text 
which was punctuated after training the CRF algorithm using BAQ Corpus with 3 POS 


tags as a feature. 
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Table 5.5.3.1: Results of MSA text punctuation marks prediction using CRF model. 


Gold Prediction 
В Моп- 5 Моп- RA e 
Exp. # N Punctuation Pune Punctuation PUE TP TN | ЕР | FN | Accuracy | Recall | Precision | F-score | Specificity | BCR 
Baseline | 3859 574 3285 0 3859 | 0 |3285 |0 | 574 0.851 0.000 0.000 0.000 1.000 0.500 
3859 574 3285 264 3595 | 207 | 3280 | 57 | 315 0.904 0.397 0.784 0.527 0.983 0.690 
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Original MSA Text 

تقف البشرية اليوم على BL‏ الهاوية] Y‏ بسبب التهديد بالفناء المعلق على رأسها| فهذا Be‏ للمرض 
وليس هو المرض| ولكن بسبب إفلاسها في dle‏ القم التي يمكن أن تمو الحياة الإنسانية في ظلالها نوا 
سلما وترتقي ترقياً صحيحاً) وهذا وام كل الوضوح في Ba all ДЫЙ‏ الذي لم يعد لديه ما يعطيه للبشرية 


من dll‏ بل الذي لم يعد إديه ما one? ai‏ بإستحقاقه للوجود» 


Punctuated MSA Text 

تقف البشرية اليوم على حافة الهاوية] لا بسبب التهديد بالفناء المعلق على رأسها| فهذا عرض للمرض 
وليس هو المرض| ولكن بسبب إفلاسها في fle‏ القم التي Se‏ أن تمو الحياة الإنسانية في ظلالها موأ 
سلها [ые ay ву,‏ وهذا واضم كل الوضوح في العام الغربي|] الذي لم يعد اديه ما يعطيه للبشرية 


من القم|] بل الذي لم يعد إديه ما ай‏ ضميره بإستحقاقه р‏ 


Figure 5.5.3.1 : Automatic punctuation mark prediction for MSA text using the CRF 
model with the 3-POS tags. 
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5.6 Discussion Of The Results 


Below is a comparison between the Accuracy and BCR ratios of the three ML 
algorithms in the 3-POS and 10-POS sets of experiments across the two tasks of word 


and sentence terminal punctuation. Consider Tables 5.6.1 and 5.6.2 below: 


Table 5.6.1: ML algorithms Compared: Punctuation marks prediction (9-class problem). 


ML alg. N-gram N-gram HMM HMM CRF CRF 
Metrics 3-POS 10-POS 3-POS 10-POS 3-POS 10-POS 
Accuracy 0.839 0.833 0.841 0.841 0.934 0.922 

BCR 0.647 0.650 0.639 0.637 0.864 0.858 


Table 5.6.2: ML algorithms Compared: Sentence terminal prediction (2-class problem). 


ML alg. N-gram N-gram HMM HMM CRF CRF 
Metrics 3-POS 10-POS 3-POS 10-POS 3-POS 10-POS 
Accuracy 0.918 0.918 0.918 0.918 0.914 0.914 

BCR 0.700 0.696 0.692 0.690 0.646 0.650 


To start with, there appears to be no significant difference in accuracy or BCR between 
an ML algorithm’s handling of our two-class and nine-class problems. Notice in Tables 
5.6.3 and 5.6.4 that the differences between two ML algorithm’s accuracy and BCR 
rates vis-a-vis the 3-POS and their counterparts in the 10-POS sets of experiments 
ranges between 0 and 0.006. The differences between an algorithm’s performances in 
the two sets of experiments in the 2-class and in the 9-class punctuation tasks were 
extremely insignificant. This implies that detailed annotation is of little importance as 


far as these ML algorithms are concerned. 
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CRF had a similar behavior and there was no significant difference between its BCR 
rates in the 3-POS and the 10-POS sets whether while punctuation mark prediction or 
sentence terminal prediction. Its accuracy rates for the two types of POS annotation in 
the two sets of experiments were in fact identical in sentence terminal prediction as 
well. The only discrepancy is the accuracy rate obtained when CRF punctuation mark 
prediction in the 3-POS and 10-POS annotation contexts; the difference between the 


two rates being 0.012. 


Table 5.6.3: Differences between N-gram, HMM, and CRF's performances in 


punctuation marks prediction and sentence terminal punctuation in 3-POS vis-à-vis 10- 


POS experiments 


ML alg. N-gram N-gram HMM HMM CRF CRF 
Metrics 3-POS 10-POS 3-POS 10-POS 3-POS 10-POS 
Accuracy 0.006 --- 0 dex 0.012 m 

BCR -0.003 --- 0.002 --- 0.006 --- 


Table 5.6.4: Differences between N-gram, HMM, and CRF's performances in word 


punctuation and sentence terminal punctuation in 3-POS vis-à-vis 10-POS experiments 


ML alg. N-gram N-gram HMM HMM CRF CRF 
Metrics 3-POS 10-POS 3-POS 10-POS 3-POS 10-POS 
Accuracy 0 --- 0 =. 0 m 

BCR 0.004 --- 0.002 --- -0.004 --- 


One would have concluded that annotation details cause CRF's performance to decline 
had it not been the case that the BCR measurement was in contradiction and so were 
CRF's accuracy and BCR rates in sentence terminal prediction experiments. We are 


compelled to consider the 0.012 value an anomaly that goes against the trend 
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established by N-gram, HMM, and CRF itself. In other words, the two types of 
annotation (3-POS and 10-POS) yield similar results regardless of which ML algorithm 
is used for automatic punctuation. In other words, detailed POS annotation does not 
improve punctuation annotation. We conclude that users can eaisly punctuate a piece of 
Arabic text with modest knowledge of Arabic grammar. They only need to know the 


main POS of the word to accomplish the punctuation task. 


Having established that there is no significant difference in the performance of the three 
ML algorithms vis-à-vis the 3-POS and 10-POS experimental conditions, we will cease 
to refer to them in the discussion below to make it clearer even though we will continue 
to report their results. Now, let's find out whether there are any differences in 
performance between the three ML models. Consider N-gram and HMM did not 
perform much differently from one another. With 0.002 and 0.008 differences in 
accuracy between them, one may venture to claim that they have similar performances 
since these differences are so insignificant. What may set them apart is the difference in 
BCR. With N-gram performing better than HMM in the 3-POS set of experiments by 
0.008 and in the 10-POS set by 0.013, one may venture to claim that N-gram performed 
slightly better than HMM did. Most probably this difference is not significant either. It 
would probably be more accurate to claim that the two algorithms are of equal 
capability when it comes to annotating words with punctuation marks. CRF, on the 
other hand, outperformed these two in terms of accuracy by 0.093 and 0.081 in the 3- 
POS and 10-POS sets of experiments respectively and in terms of BCR by 0.225 and 
0.221 respectively as well. Clearly then, CRF is the most efficient algorithm for 
automatic punctuation if it is desired to punctuate with commas, semicolons, hyphens, 
and sentence terminals. If the punctuation of concern is sentence-terminal only, 


however, N-gram and HMM are identical in capability with regard to accuracy and 
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almost identical in terms of BCR (the difference being 0.008 in the 3-POS set of 
experiments and 0.006 in the 10-POS set). This confirms our conclusion above that 
there is insignificant difference between these two algorithms. When they are compared 
with CRF, however, they are similar in terms of accuracy (the difference being a mere 
0.004 in favor of the two algorithms) but they are superior to it in terms of BCR, with 


the difference being higher than 0.040. 


Tables 5.6.5 and 5.6.6 below where the accuracy rate of HMM in the 3-POS condition 
was deducted from N-gram's corresponding accuracy rate in each table, its 3-POS BCR 


rate is deducted from its N-gram corresponding BCR rate, etc. 


Table 5.6.5: N-gram vs. HMM vs. CRF in punctuation mark prediction. 


ML alg. N-gram N-gram 


Metrics 3-POS 10-POS 3-POS 10-POS 3-POS 10-POS 


Accuracy | -0.002 -0.008 --- --- -0.093 -0.081 


ВСК 0.008 0.013 --- --- -0.225 -0.221 


Table 5.6.6: N-gram vs. HMM vs. CRF in sentence terminal prediction. 


MLalg. N-gram N-gram HMM HMM CRF CRF 
Metrics 3-POS 10-POS 3-POS 10-POS 3-POS 10-POS 

Accuracy 0 0 --- --- 0.004 0.004 

BCR 0.008 0.006 --- --- 0.046 0.040 


When the three ML algorithms' performances are compared in terms of punctuation 
mark prediction vs. sentence terminal punctuation, as in Table 5.6.7, N-gram's and 
HMM’s accuracy and BCR soar by nearly 8% and 5% respectively. CRF’s accuracy, 


however, declines by 2% and its BCR plunges by around 21%. 
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Table 5.6.7: N-gram vs. HMM vs. CRF vis-à-vis the punctuation marks prediction and 


sentence terminal prediction. 


ML algo. N-gram N-gram HMM HMM CRF CRF 
Metrics 3-POS 10-POS 3-POS 10-POS 3-POS 10-POS 

Accuracy -0.079 -0.085 -0.077 -0.077 0.020 0.008 

BCR -0.053 -0.046 -0.053 -0.053 0.218 0.208 


CRF does better at word level punctuation because it takes conditional distribution into 
account and they are only focused on the POS labels; if this POS is followed by that 
POS, then use or do not use this punctuation mark. HMM, on the other hand, does better 
at sentence level for two reasons: (1) it is more focused on joint distributions of POS 
tags as they join to form sentences; (2) it takes account of the distribution over both the 
word observations and their POS labels, while CRF does not. In other words, HMM 
benefits from pattern sequences in two sets of data, the words and their POS tags. Table 
5.6.8 shows a summary of results for all the experiments in the research. Figures 5.6.1 
and 5.6.2 shows the charts of the three ML algorithms for both, punctuation mark 


prediction and sentence terminal prdiction with 3-POS and 10-POS tag sets. 
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Table 5.6.8: Summary of expierments results. 


Nine/ Two ML POS Train / Test / BAQ Accuracy Accuracy BCR BCR 
class Algorithm | Categoriy BAQ Or MSA Baseline Score baseline Score 
problem Average Average Average Average 


анан | ным | юв | эю | Nx (o ош | ош | озю | oer 
[s [== | оғ | съз | эю | wx | sms | ом | озю | бш _ 


e | Nims | ск ros | Ox | oux | ош | ош | озю مي‎ 
_& теле | Nimm | oros | эю | Nx | ош | он | озю | бз _ 
[s [mede ним | sr | wx | wx | om | owe | | omm — 
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Average Accuracy - 3 Average Accuracy - 10 
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N-gram HMM СКЕ 


Figure 5.6.1: Charts for punctuation marks prediction using ML algorithms 
with 3 and 10 POS tags. 
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Figure 5.6.2: Charts for sentence terminal prediction using ML algorithms 
with 3 and 10 POS tags. 
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5.6.1 Results for Predicting Individual Punctuation Marks 


This section investigates the results of predicting individual punctuation marks. We 
selected the punctuation marks (full stop, exclamation mark, comma, question mark, 
colon and simi-colon) to investigate the results of prediction for the 3 ML algorithms 


with 3-POS tags. These algorithms were trained and tested using the BAQ corpus. 


The results showed that the CRF algorithm proved again its superiority for predicting all 
of the 6 punctuation marks. The highest results of F-score was for predicting full stop of 
100% using CRF algorithm. The CRF scored F-score of 85% for predicting the colon. It 
scored F-score of 63% for comma. It also scored 37%, 20% and 10% for predicting 
exclamation mark, question mark and simi-colon respectively. Table 5.6.1.1 shows the 
results for predicting each punctuation mark using the 3 ML algorithms with 3 POS 


tags. 


In conclusion, CRF algorithm can predict internal punctuation marks such as comma 
better than the other models. 63% of the commas in the test sample were correctly 


predicted using the CRF algorithm. 
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Table 5.6.1.1: F-score results for five punctuation marks from 3-POS 
punctuation marks prediction experiment. 


Punctuation ML algorithm F-score 
mark 
N-gram 0 
! HMM 0 
CRF 0.37 
| N-gram | 0.16 
5 HMM 0.15 
CRF 0.63 
N-gram 0.68 
HMM 0.57 
CRF 1.00 
| Nem | 0.76 
HMM 0.71 
CRF 0.85 
N-gram 0 
0 HMM 0 
CRF 0.10 
N-gram 0.07 
7 HMM 0 
CRF 0.20 
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Chapter Six 


Conclusions and Recommendations 


This chapter summarizes the main findings of this thesis. In addition, it describes our 
plan for future work in order to enhance our punctuation marks prediction system for 
Arabic text. Our proposal for the enhancement of the prediction system is (i) using other 
ML algorithms and (ii) enhancing the corpus with more features that will help ML 
algorithm to predict punctuation marks for Arabic text. The first section of this chapter 
presents the major conclusions of our research. The second section presents our 


proposal for future work. 


6.1 Conclusion 


This research discusses the problem of punctuation marks prediction for Arabic texts. 
We showed that the Arab readers have deficiencies in the usage of punctuation marks 
for Arabic text (Khafaji, 2001). In addition, our research highlighted the importance of 
punctuation marks for understanding Arabic text and Quran text. Also, we listed the 
Natural Language Processing Applications (NLP) were punctuation marks are useful. 
These applications are POS tagging, phrasing, information extraction etc. We also 
overviewed the contributions of other researches that implemented solutions for 


predicting punctuation marks for other foreign languages. 


The Holy Quran is the central text of Islam and Arabic. In our research we considered 
the Quran as a gold standard for our experiments. Three Machine Learning (ML) 
algorithms (i.e. N-gram, HMM and CRF algorithms), were used to investigate the 
automatic prediction of punctuation marks and sentence terminals for Arabic text. To 


achieve our goal we used the Boundary Annotated Quran (BAQ) Corpus (Sawalha, et 
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al., 2012) for training and testing the 3 ML algorithms. We added new tier to the BAQ 
corpus containing punctuation marks and sentence terminals as used in Sayyid Qutb's 


Quran exegesis called J A في ظلال‎ “ff zilal al-qur ûn”. 


To test our hypothesis that ML algorithms can predict punctuation marks for Arabic 
text, we designed 13 different experiments. The first set of experiments, train and test 
the three ML algorithms on the BAQ corpus for predicting punctuation marks as a nine- 
class problem. The ML algorithms for this set of experiments use either 3 POS tags and 
the word or 10 POS tags and the word as features for prediction. The second set of 
experiments train and test the ML algorithm on BAQ corpus for predicting sentence 
terminals as two-class problem. The ML algorithm use either 3 POS tags and the word 
or 10 POS tags and the word as features for prediction. The last experiment, applies 
CRF algorithm to MSA text after been trained on the BAQ corpus. This experiment was 
designed to test our second hypothesis which is knowledge learnt from the Quran can be 


transferred to MSA text. 


We have tried to examine which of the three machine learning algorithms is the best 
suited for automatically punctuating and sentence terminal predicting of the Modern 
Standard Arabic (MSA) texts. In addition, we have tried to answer the question of what 


type of linguistic annotation is required for the best automatic punctuation performance. 


The results of experiments show that the CRF model has the best performance results 
for the punctuation marks prediction (i.e. nine-class problem) task for the two POS 
categories (i.e. 3-POS tags and 10-POS tags). The CRF model scored 93.4% and 86.4% 
for the average accuracy and average BCR respectively with the 3-POS tags, while it 
scored 92.2% and 85.8% for the average accuracy and average BCR respectively. The 


N-gram model was ranked second; it scored 83.9% and 64.7% for the average accuracy 
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and BCR average respectively with the 3-POS tags, while it scored 80.3% and 64.9% 
for the average accuracy and average BCR respectively with the 10-POS tags. The third 
model was the HMM model. It scored 84.1% for the average accuracy and 63.9% for 
the average BCR with the 3-POS tags. On the other hand, it scored 84.096 for the 


average accuracy and 63.7% for the average BCR with the 10-POS tags. 


On the other hand, the results of experiments show the N-gram model has the best 
performance for the sentence terminal prediction (i.e. two-class problem) task for the 
both POS tags (i.e. 3-POS tags and 10-POS tags). The N-gram model scored 91.8% for 
the average accuracy; also it scored 70.096 for the average BCR with the 3-POS tags. In 
addition, the N-gram model scored 91.8% for the average accuracy and 69.6% for the 
average BCR with the 10-POS tags. The HMM model was ranked second. It scored 
91.8% for the average accuracy and 69.2% for the average BCR with the 3-POS tags. 
On the other hand, the HMM model scored 91.8% for the average accuracy and 69.6% 
for the average BCR with the 10-POS tags. Finally, the CRF model was the last ranked 
model. The CRF model scored 91.4% for the average accuracy and 64.6% for the 
average BCR with the 3-POS tags. For the 10-POS tags, the CRF model scored 91.4% 


for the average accuracy and 65.05 for the average BCR. 


The results of experiments showed the superiority of the CRF algorithm over other 
machine learning algorithms for automatic punctuation marks prediction. The 
superiority of the CRF algorithm is justified by its ability of investigating long range of 
dependencies between a sequence of observations (words and its POS tags) and their 
corresponding tags (punctuation). In addition, the CRF model proved its ability to 
investigate the internal punctuation marks with least number of errors compared with 


the other algorithms. 
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Furthermore, the experiments results proved that the machine learning algorithms 
performs slightly better when the annotation is coarse than when it is fine-grained. 
Apparently, because patterns can be more easily emerge with coarse than with fine- 
grained annotation. Therefore, we could say that the language users need not to possess 
a high level of linguistic knowledge to be able to punctuate a text with proper use of 
punctuation marks. This means, if the user only knows simple grammar rules or modest 


knowledge of Arabic grammar, then the user is able punctuate any Arabic texts. 


6.2 Future Work 


Further research may be required to improve accuracy of both punctuation marks 
prediction and sentence terminal prediction. This improvement could be achieved by 
adding more linguistic features to the used corpus. Many linguistic features can be 
added such as; grammatical annotation, parsed sentences or simple classification of 


words into content or function words. 


Another way to improve the prediction of punctuation marks and sentence terminals is 
to use different Machine Learning algorithms such as the Dynamic CRF algorithm. 
Dynamic CRF algorithm used to investigate more than one feature at the same time 


such as; POS tags and punctuation marks. 
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التنبؤ بعلامات الترقيم للغة العربية الكلاسيكية والحديثة 


إعداد 
علاء محمد الصلمان 
المشرف 
الدكتور مجدي شاكر صوالحه 
المشرف المشارك 
الأستاذ الدكتور حسين محمد ياغي 


ملخص 
يجد الكتاب العرب صعوبة في عملية إستخدام علامات الترقيم الحديثةء لذلك إقترح علماء اللغة 
العربية بضرورة مراجعة قواعد إستخدام هذه العلامات. يعتقد البعض بأنه يجب التقيد عند إستخدام 
علامات الترقيم بقواعد نحوية» ولكن ما هو مستوى الكفاءة والمعرفة التي يجب أن АР‏ بها مستخدم 
تعليم الآلة أن تتعامل مع مهمة الترقيم التلقائي للنصوص العربية؟ وهل تستطيع خوارزميات تعليم 
الآلة إنتاج نموذج يمكن من خلاله ترقيم أي من النصوص العربية؟ هذه الأسئلة يمكن إجابتها من 
خلال هذا البحث. 


Conditional Random من خوارزميات تعليم الآلة تم إستخدامها في هذا البحث وهي:‎ А506 
أنه تم تدريبها‎ Cus Hidden Markov Model (HMM) و‎ N-gram و‎ Felids (CRF) 
Boundary Annotated Qur'an المعنونة بعلامات الوقف‎ Aul jill وفحصها على الذخيرة‎ 
من بعد أن قمنا بإدخال علامات الترقيم الحديثة لهذه الذخيرة. هذه الخوارزميات الثلاث‎ (BAQ) 
تم فحصها على الذخيرة بإستخدام نوعين من علامات الخطاب الدقيقة والعامة وهي: عشرة من‎ 
علامات الخطاب و ثلاثة من علامات الخطاب على الترتيب. ودلّت النتائج على أنه مع إستخدام‎ 
علامات الخطاب العامة (ثلاثة من علامات الخطاب) فإنه سوف يكون هنالك تحسن بسيط مقارنة‎ 
بإستخدام علامات الخطاب الدقيقة (عشرة من علامات الترقيم). وهذا يدلنا على أن مستخدم اللغة‎ 
العربية لا يحتاج الى أن يكون ملماً كثيراً بالقواعد النحوية حتى يستطيع أن يقوم بترقيم أي نص‎ 
حينما إستخدمنا خوارزمية‎ Ай) وعلاوة على ذلك» دلت النتائج‎ ым! عربي بعلامات الترقيم‎ 
بأن‎ Lele جيده.‎ giti من أجل الترقيم الآلي لنص عربي حديث فإنه يعطينا‎ (CRF) التعليم الآلي‎ 
تدريب الآلة على الترقيم الآلي لذخيرة قرآنية يفيد في ترقيم النصوص العربية الحديثة غير القرآنية.‎ 


